Tutorial Blast (NCBI)

 Object: Starting with an organism and a protein, find a protein sequence and gene coding region. 

Example: Find the protein sequence and gene coding region for pathogenicity factor listeriolysin O from the bacterium Listeria monocytogenes. 

 

Searching for gene and protein information 

Begin the search in Gene, because it has less redundancy than Protein (this same search in Protein retrieves over 1000 records). 

Search: Listeria monocytogenes[organism] AND listeriolysin O[protein name] 

Gene search 

One record, for gene symbol hly, is retrieved. It is associated with an NC_ accession number (specifying a complete genomic molecule that is usually a reference assembly; see RefSeq accession numbers and molecule types). 

To find the gene coding sequence, look at the Genomic regions, transcripts, and products section or the NCBI Reference Sequences (RefSeq) section of the Gene record: 

 

  

Clicking on the GenBank link displays the GenBank record in the Nucleotide database. The coding sequence for the gene hly can be found under CDS in the Features section of the record (outlined in red): 

 

The GenBank record for this gene also shows its location on the chromosome and the translated protein sequence (outlined in blue). The protein sequence can also be found by clicking on the protein accession number in the Nucleotide record or in the RefSeq section of the Gene record. 

Sample GenBank record 


The Basic Local Alignment Search Tool (BLAST) finds regions of similarity between sequences. The program compares nucleotide or protein sequences and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. 

 

There are several types of BLAST searches. NCBI's WebBLAST offers four main search types: 

  • BLASTn (Nucleotide BLAST): compares one or more nucleotide query sequences to a subject nucleotide sequence or a database of nucleotide sequences. This is useful when trying to determine the evolutionary relationships among different organisms (see Comparing two or more sequences below). 

  • BLASTx (translated nucleotide sequence searched against protein sequences): compares a nucleotide query sequence that is translated in six reading frames (resulting in six protein sequences) against a database of protein sequences. Because blastx translates the query sequence in all six reading frames and provides combined significance statistics for hits to different frames, it is particularly useful when the reading frame of the query sequence is unknown or it contains errors that may lead to frame shifts or other coding errors. Thus blastx is often the first analysis performed with a newly determined nucleotide sequence. 

  • tBLASTn (protein sequence searched against translated nucleotide sequences): compares a protein query sequence against the six-frame translations of a database of nucleotide sequences. Tblastn is useful for finding homologous protein coding regions in unannotated nucleotide sequences such as expressed sequence tags (ESTs) and draft genome records (HTG), located in the BLAST databases est and htgs, respectively. ESTs are short, single-read cDNA sequences. They comprise the largest pool of sequence data for many organisms and contain portions of transcripts from many uncharacterized genes. Since ESTs have no annotated coding sequences, there are no corresponding protein translations in the BLAST protein databases. Hence a tblastn search is the only way to search for these potential coding regions at the protein level. The HTG sequences, draft sequences from various genome projects or large genomic clones, are another large source of unannotated coding regions. 

  • BLASTp (Protein BLAST): compares one or more protein query sequences to a subject protein sequence or a database of protein sequences. This is useful when trying to identify a protein (see From sequence to protein and gene below). 

There are also standalone and API BLAST options as well as pre-populated specialized searches available on the BLAST homepage linked above. 


From sequence to protein and gene 

Object: Starting with a sequence, identify the protein or gene and the source. 

Example: From the following sequence (available at http://tinyurl.com/blastp-sequence, or copy the sequence below), identify the most probable protein and organism: 

MSKRKAPQET LNGGITDMLT ELANFEKNVS QAIHKYNAYR KAASVIAKYP HKIKSGAEAK 
KLPGVGTKIA EKIDEFLATG KLRKLEKIRQ DDTSSSINFL TRVSGIGPSA ARKFVDEGIK 
TLEDLRKNED KLNHHQRIGL KYFGDFEKRI PREEMLQMQD IVLNEVKKVD SEYIATVCGS 
FRRGAESSGD MDVLLTHPSF TSESTKQPKL LHQVVEQLQK VHFITDTLSK GETKFMGVCQ 
LPSKNDEKEY PHRRIDIRLI PKDQYYCGVL YFTGSDIFNK NMRAHALEKG FTINEYTIRP 
LGVTGVAGEP LPVDSEKDIF DYIQWKYREP KDRSE 
  

 

Querying a sequence 

Protein and gene sequence comparisons are done with BLAST (Basic Local Alignment Search Tool). 

To access BLAST, go to Sequence Analysis > Tools > BLAST: 

ncbi_homepage_BLAST 

This is an unknown protein sequence that we are seeking to identify by comparing it to known protein sequences, and so Protein BLAST should be selected from the BLAST menu: 

Protein BLAST 

Enter the query sequence in the search box, provide a job title, choose a database to query, and click BLAST: 

 

 

Viewing your results 

Under the Alignments tab next to Alignment view select Pairwise with dots for identities. 

 

  

View the Descriptions tab to see a list of significant alignments. Note that the first match is a synthetic construct (that is, the sequence was computationally derived and is not associated with any organism): 

BLASTp description table 

Key for default display: 

  • Max[imumScore: the highest alignment score calculated from the sum of the rewards for matched nucleotides and penalities for mismatches and gaps. 

  • Total Score: the sum of alignment scores of all segments from the same subject sequence. 

  • Query Cover[age]: the percent of the query length that is included in the aligned segments. 

  • E[xpectValue: the number of alignments expected by chance with the calculated score or better. The expect value is the default sorting metric; for significant alignments the E value should be very close to zero. 

  • Ident[ity]: the highest percent identity for a set of aligned segments to the same subject sequence. 

  • Acc[essionLen[gth]: the number of nucleotides or amino acids in the result sequence identified by the accession number 

  • Accession [number]: a unique identifier assigned to records in the NCBI databases 

Clicking on a protein name displays the pairwise sequence alignment and links to additional information about the protein and its associated gene (if available). 

BLASTp result display pairwise with dots for alignment 

  

For the pairwise with dots for identities display, any differing amino acid in the subject sequence will be displayed in red: 

BLASTp result showing misalignment 

 

Saving your results 

To save your search queries and settings, click on the Save Search link, then log in to My NCBI using the Sign in or Register link at the upper right. Once you do this, your search strategies should appear in the Saved Search Strategies tab. 

 

 

Comments