BLAST Searching Tutorial
Learn how to BLAST your sequence against GenBank or custom databases to find similar sequences. This tutorial covers single and batch sequence searches, and options for displaying and exporting results.
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to identify unknown or new sequences, infer functional and evolutionary relationships between sequences, and identify members of gene families.
In Geneious Prime you can search external databases provided by NCBI, or search a custom database created from your own sequences. The result of a BLAST search will be an aligned set of potentially related sequences ranked according to similarity.
More information on BLAST can be found at http://en.wikipedia.org/wiki/BLAST.
To complete the tutorial yourself with included sequence data, download the tutorial and install it by dragging and dropping the zip file into Geneious Prime. Do not unzip the tutorial.
Exercise 1: Running a single BLAST search
To run a BLAST search on a single sequence, click on P00656 and then click the BLAST button on the tool bar. This will bring up the BLAST dialog.
Geneious Prime automatically determines the sequence type (nucleotide or protein) and shows the appropriate settings for that type. The default database selected is Nucleotide Collection (nr/nt), which contains protein (nr) and nucleotide (nt) accessions. As P00656 is a protein sequence, you can either use tblastn to query the amino acid sequence against the translated nucleotide database, or use blastp to query the protein database. See Exercise 4 for more information on the different BLAST algorithms available.
Select blastp for the Program, and ensure Results is set to Hit Table, and Retrieve is set to Matching Region. This setting will return a list of the top hits and an alignment for each one, plus a query-centric alignment.
Leave the other settings as they are, then click the Search button. Geneious will then send your query to the NCBI and create a new search folder. This will appear as a subfolder of the folder that contains your query sequence.
The folder name shows the sequence used for the query, the database searched, the program used to perform the search, and the number of results returned in brackets.
Exercise 2: Viewing the results
When the search completes, all of the results will be downloaded from NCBI and placed in the newly created folder. By default, the search results should be ordered by their E Value which indicates the expected frequency of occurrence of each alignment by chance. If your results are not ordered by E value, click the E Value column header. Your hit table should look something like the table below, but the actual hits may vary slightly as new sequences are added to GenBank all the time.
For E values, the smaller the number the better. These are displayed using exponents. The top hit shown here as 1.18e-107 is the same as 1.18×10-107. This is a very small number and indicates that it is highly unlikely that this alignment would ever occur by chance. You may even have examples where the E Value reads 0.00e+00 and this is telling you that statistically there is no likelihood that this alignment has happened by chance. You should take these statistics as a guide as there can still be interesting alignments that appear far less significant.
In addition to the E Value, there is also a column labelled % Pairwise Identity. This is also useful as it will indicate how similar the sequence found in the database is to the one you used as a query. You can see that many of the hits in this example are 100% identical to the query over the length of the alignment, but have different Sequence Lengths. This is because the alignment produced is a local similarity alignment and it has aligned the maximum region it could find between the two sequences. The identity refers only to the aligned region so it is possible to have very short alignments which have high identity. This is why alignments tend to be ranked by their E Value rather than identity. Geneious also produces a Grade score, which combines query coverage, e-value and identity values for each hit with weights 0.5, 0.25 and 0.25 respectively, allowing you to determine the longest, highest identity hits.
Now that you have a set of search results, you should look at some alignments. Click on the hit to NP_001014408 and you should see something like this:
You can see from the green identity graph above the alignment that the two sequences are identical. Like any other alignment in Geneious, you can zoom into display the bases, change the color settings, and highlight agreements or disagreements to the consensus in the General controls to the right of the viewer.
This alignment view only shows the region of alignment between the query and the hit sequence. The blast hit document returned is a summary document and does not contain the full GenBank record for that sequence. To get the full sequence and annotations for the blast hit, click Download Full Sequence(s). Once the full sequence is downloaded you’ll see that a Sequence View tab is added to the viewer. This displays the full, annotated sequence of the BLAST hit, with a new “BLAST Hit” annotation showing which region of the sequence matches the query.
Query-centric view is useful for visualizing all the hits against your query in one window, allowing you to see where conserved regions of your sequence are. Click on the Query Centric View tab at the top of the Hit table, then turn off the annotations in the Annotations and Tracks tab, and in the Display tab choose to highlight Disagreements to Reference. Your display should look something like this:
The query sequence is presented as a reference sequence, with yellow shading, at the top of the alignment. You can see that many of the top hits are extremely similar to the query, indicating that this protein is highly conserved across most of its length. The first 20 residues of the query may be less well-conserved as many of the hits do not span this region. The sequences are aligned in order of E-value and if you scroll down you’ll see that the sequences become more distantly related to the query as the E-value decreases.
Exercise 3: Batch searches
So far, you have only performed a single sequence search. However, the Geneious BLAST interface also allows you to perform searches on multiple sequences at once.
To demonstrate batch searching, select the liver ESTs file and click BLAST. The search options should show how many sequences you have selected for the batch search. Because these are nucleotide sequences the default search algorithm chosen is Megablast. This program will find only high similarity matches so may return no hits if your sequences are from a less common species. Change this to blastn for a more sensitive search, and keep the other settings at their defaults. Your search dialog should look like this:
Clicking the Search button now will create a folder for the batch search. Be patient, it may take some time for the results of all searches to be returned. You can do other work with Geneious while the batch search is running.
When the results complete, they should look something like this
As with the single sequence search, each folder contains the results for the query sequence indicated by the name of the folder.
When batch BLASTing large numbers of sequences it can be impractical to have a separate search result folder created for each query. Instead, it may be more useful to only return a single alignment for each query. To do this, select the liver ESTs file again, click BLAST and under Results choose Query-centric alignments only. Change the Maximum Hits to 5 and click Search.
This time only a single results folder is returned containing an alignment for each query sequence.
The Search Hit annotation on each sequence contains the statistics for the blast match, such as E value, pairwise identity, sequence length etc. You can display this information in tabular format in the Annotations table. To do this, select all the alignments and switch to the Annotations tab. The default parameters are not set up for BLAST annotations, so we will change them by clicking the Columns button. In the Columns list, uncheck #Intervals, Direction, Maximum, Minimum, Name, Sequence and Type, and instead check %Pairwise Identity, E value, Hit start, hit end, Organism, Query, Query start, Query end and Sequence Name. You can arrange these columns as you wish by dragging and dropping them, and you should end up with a table that looks something like this:
This table can be exported in .csv format if required.
Filtering sequences using BLAST
When batch BLASTing there is also an option to bin into “has hit” vs “no hit” in database under Results. This is useful for filtering sequence reads for contamination from a particular non-target species such as human or E. coli. Instead of returning a sequence hit, the query sequences are sorted into “has hit” and “no hit” sequence lists.
Types of BLAST
By default, Geneious Prime will offer a BLAST algorithm most appropriate for the type of query sequence and type of database selected. The following algorithms are available:
- Megablast – fast, but only returns highly similar matches.
- Discontiguous megablast – more sensitive, allows more dissimilar matches, and can be set to ignore certain types of bases.
- blastn – slower but most sensitive option, allows more dissimilar matches. Best option for more distantly related query species.
- blastx – translates the query to protein and searches an amino acid database.
- blastp – compares protein query to a protein database
- tblastn – compares protein query to all 6 frames of a translated nucleotide database.
When batch searching, all sequences will be compared using the same program against the same database.
Depending on the search program selected, you can change the search options by clicking on More Options. This will bring up program specific settings to allow you to further optimize your search. This is the expanded dialog for Megablast searches:
To ensure the results returned include only significant matches, the E value should be set on 1e-3 or less. E values higher than this may indicate that the match has occurred by chance. Changing the scoring table, word size and gap costs will also affect the number of hits returned and the aligned regions. For detailed information about these parameters, see the NCBI BLAST Help.
It is possible to restrict the results of a BLAST search to a specific species, or to exclude certain types of sequence by using an Entrez query in the advanced options. For example, to only return results from mouse, enter Mus musculus[organism] in the Entrez query field. To remove results that come from uncultured, unclassified, or environmental samples you can use all[filter] NOT uncultured[filter] NOT “environmental”[filter] NOT unclassified[filter].
For further examples of Entrez query terms, see here.
You may also be interested in running BLAST searches on your own computer and making your own databases to search against. You can add a local BLAST service from the BLAST dialog by clicking the Add/Remove Databases button and selecting Set Up BLAST Services. This will then guide you through installing BLAST. You will then be able to add searchable databases from downloaded sequence files, or from selected sequences within Geneious. This is a convenient way of maintaining a specialist sequence set to compare sequences against.