Skip to main content

Tutorial Blast (NCBI)

 Object: Starting with an organism and a protein, find a protein sequence and gene coding region. 

Example: Find the protein sequence and gene coding region for pathogenicity factor listeriolysin O from the bacterium Listeria monocytogenes. 

 

Searching for gene and protein information 

Begin the search in Gene, because it has less redundancy than Protein (this same search in Protein retrieves over 1000 records). 

Search: Listeria monocytogenes[organism] AND listeriolysin O[protein name] 

Gene search 

One record, for gene symbol hly, is retrieved. It is associated with an NC_ accession number (specifying a complete genomic molecule that is usually a reference assembly; see RefSeq accession numbers and molecule types). 

To find the gene coding sequence, look at the Genomic regions, transcripts, and products section or the NCBI Reference Sequences (RefSeq) section of the Gene record: 

 

  

Clicking on the GenBank link displays the GenBank record in the Nucleotide database. The coding sequence for the gene hly can be found under CDS in the Features section of the record (outlined in red): 

 

The GenBank record for this gene also shows its location on the chromosome and the translated protein sequence (outlined in blue). The protein sequence can also be found by clicking on the protein accession number in the Nucleotide record or in the RefSeq section of the Gene record. 

Sample GenBank record 


The Basic Local Alignment Search Tool (BLAST) finds regions of similarity between sequences. The program compares nucleotide or protein sequences and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. 

 

There are several types of BLAST searches. NCBI's WebBLAST offers four main search types: 

  • BLASTn (Nucleotide BLAST): compares one or more nucleotide query sequences to a subject nucleotide sequence or a database of nucleotide sequences. This is useful when trying to determine the evolutionary relationships among different organisms (see Comparing two or more sequences below). 

  • BLASTx (translated nucleotide sequence searched against protein sequences): compares a nucleotide query sequence that is translated in six reading frames (resulting in six protein sequences) against a database of protein sequences. Because blastx translates the query sequence in all six reading frames and provides combined significance statistics for hits to different frames, it is particularly useful when the reading frame of the query sequence is unknown or it contains errors that may lead to frame shifts or other coding errors. Thus blastx is often the first analysis performed with a newly determined nucleotide sequence. 

  • tBLASTn (protein sequence searched against translated nucleotide sequences): compares a protein query sequence against the six-frame translations of a database of nucleotide sequences. Tblastn is useful for finding homologous protein coding regions in unannotated nucleotide sequences such as expressed sequence tags (ESTs) and draft genome records (HTG), located in the BLAST databases est and htgs, respectively. ESTs are short, single-read cDNA sequences. They comprise the largest pool of sequence data for many organisms and contain portions of transcripts from many uncharacterized genes. Since ESTs have no annotated coding sequences, there are no corresponding protein translations in the BLAST protein databases. Hence a tblastn search is the only way to search for these potential coding regions at the protein level. The HTG sequences, draft sequences from various genome projects or large genomic clones, are another large source of unannotated coding regions. 

  • BLASTp (Protein BLAST): compares one or more protein query sequences to a subject protein sequence or a database of protein sequences. This is useful when trying to identify a protein (see From sequence to protein and gene below). 

There are also standalone and API BLAST options as well as pre-populated specialized searches available on the BLAST homepage linked above. 


From sequence to protein and gene 

Object: Starting with a sequence, identify the protein or gene and the source. 

Example: From the following sequence (available at http://tinyurl.com/blastp-sequence, or copy the sequence below), identify the most probable protein and organism: 

MSKRKAPQET LNGGITDMLT ELANFEKNVS QAIHKYNAYR KAASVIAKYP HKIKSGAEAK 
KLPGVGTKIA EKIDEFLATG KLRKLEKIRQ DDTSSSINFL TRVSGIGPSA ARKFVDEGIK 
TLEDLRKNED KLNHHQRIGL KYFGDFEKRI PREEMLQMQD IVLNEVKKVD SEYIATVCGS 
FRRGAESSGD MDVLLTHPSF TSESTKQPKL LHQVVEQLQK VHFITDTLSK GETKFMGVCQ 
LPSKNDEKEY PHRRIDIRLI PKDQYYCGVL YFTGSDIFNK NMRAHALEKG FTINEYTIRP 
LGVTGVAGEP LPVDSEKDIF DYIQWKYREP KDRSE 
  

 

Querying a sequence 

Protein and gene sequence comparisons are done with BLAST (Basic Local Alignment Search Tool). 

To access BLAST, go to Sequence Analysis > Tools > BLAST: 

ncbi_homepage_BLAST 

This is an unknown protein sequence that we are seeking to identify by comparing it to known protein sequences, and so Protein BLAST should be selected from the BLAST menu: 

Protein BLAST 

Enter the query sequence in the search box, provide a job title, choose a database to query, and click BLAST: 

 

 

Viewing your results 

Under the Alignments tab next to Alignment view select Pairwise with dots for identities. 

 

  

View the Descriptions tab to see a list of significant alignments. Note that the first match is a synthetic construct (that is, the sequence was computationally derived and is not associated with any organism): 

BLASTp description table 

Key for default display: 

  • Max[imumScore: the highest alignment score calculated from the sum of the rewards for matched nucleotides and penalities for mismatches and gaps. 

  • Total Score: the sum of alignment scores of all segments from the same subject sequence. 

  • Query Cover[age]: the percent of the query length that is included in the aligned segments. 

  • E[xpectValue: the number of alignments expected by chance with the calculated score or better. The expect value is the default sorting metric; for significant alignments the E value should be very close to zero. 

  • Ident[ity]: the highest percent identity for a set of aligned segments to the same subject sequence. 

  • Acc[essionLen[gth]: the number of nucleotides or amino acids in the result sequence identified by the accession number 

  • Accession [number]: a unique identifier assigned to records in the NCBI databases 

Clicking on a protein name displays the pairwise sequence alignment and links to additional information about the protein and its associated gene (if available). 

BLASTp result display pairwise with dots for alignment 

  

For the pairwise with dots for identities display, any differing amino acid in the subject sequence will be displayed in red: 

BLASTp result showing misalignment 

 

Saving your results 

To save your search queries and settings, click on the Save Search link, then log in to My NCBI using the Sign in or Register link at the upper right. Once you do this, your search strategies should appear in the Saved Search Strategies tab. 

 

 

Comments

Popular posts from this blog

 Genomics_command_line_quiz1 For all projects, you may use your own Unix-based system and, where applicable, ensure that you are running the version of the software specified in the assignments. Alternatively, you may use the VMBox virtual machine environment provided with the course materials. Instructions on how to download and use the environment can be found on the course web site. For the following questions, refer to the class workflow and use the data in the Online materials (‘gencommand_proj1_data.tar.gz’) to answer the questions. Assume you sequenced and assembled the genome of Malus domestica (apple), and performed gene annotation. You then collected samples and ran RNA-seq experiments to determine sets of genes that are expressed in the various tissues. This information was stored, respectively, in the following files: “apple.genome”, “apple.genes”, “apple.condition{A,B,C}”. NOTE: The apple genome and the apple gene annotations for this project were extracted from the Rosace

Immunotherapy

 

Introduction to Molecular Biology

 Introduction to Molecular Biology Cells are fundamental building blocks of living organisms. Cells contain a nucleus, mitochondria and chloroplasts, endoplasmic reticulum, ribosomes, vacuoles, etc.  The nucleus is important organelle because it houses chromosomes which include the DNA.  The DNA is in essence a blueprint of the organism as it encodes information needed to synthesize proteins . Molecular biologist s would like to understand how human biology works with the hope to treat diseases like cancer. One can look at simpler organisms such as yeasts to understand how human biology works.  Admittedly, unicellular yeasts are very different from humans who have approximately 1014 cells. However, the DNA is similar across all living organisms. For example, humans share 99% of DNA with chimps. Naturally, we would like to know what information contained in that 1% of DNA is so critical to determine all the distinguishing features of humans,  DNA            DNA stands for deoxyribonucle