Sequence Classifier Tutorial

In this tutorial, you will use the Sequence Classifier to classify mitochondrial sequences obtained from subfossil bones. Using the Sequence Classifier and a database of known sequences, you will discover whether these sequences are from kiwi (a flightless bird endemic to New Zealand), and if so, which species and/or subtype of kiwi is the closest match.

Introduction

The Classify Sequences plugin allows you to identify the species of your query sequence by aligning it against all sequences in a specified database.

In this tutorial, we will use this tool to classify mitochondrial sequences obtained from subfossil bones. We wish to know whether these subfossil bones are from kiwi, a flightless bird endemic to New Zealand, and if so which species and/or subtype of kiwi is the closest match.

Background: About the kiwi

Kiwi are a flightless bird from the ratite family, and are endemic to New Zealand.

Five species of kiwi are currently recognised: Apteryx owenii (Little Spotted kiwi); Apteryx haastii (Great Spotted kiwi); Apteryx mantelli (North Island brown kiwi); Apteryx australis (Tokoeka, a brown kiwi found in the south of the South Island); and Apteryx rowi (Rowi, brown kiwi found only in a single population at Okarito on the west coast of the South Island).

Kiwi were once abundant and widespread across New Zealand, but with the arrival of European settlers in the 19th century, and the introduction of mammalian predators such as rats, cats, dogs and stoats, their numbers severely declined. Today kiwi populations are fragmented (see map below), and all are threatened with extinction. Sub-fossil kiwi remains have been found in several locations outside of the present-day range of kiwi, and have been used to help clarify species boundaries. Distinguishing among the kiwi species is not always possible from bone morphology alone, so mitochondrial DNA sequences are commonly used to identify bones.

INSTRUCTIONS
To complete the tutorial yourself with included sequence data, download the tutorial and install it by dragging and dropping the zip file into Geneious Prime. Do not unzip the tutorial.

DOWNLOAD TUTORIAL

EXERCISE 1
Setting up your database

EXERCISE 2
Choosing appropriate parameters

EXERCISE 3
Running the sequence classifier

EXERCISE 4
Interpreting the results

GENEIOUS ACADEMY

Exercise 1: Setting up your database

The classify sequences plugin uses pairwise alignments to compare your query sequence to others in a specified database. Here we will use kiwi mitochondrial DNA sequences available on Genbank as our database to classify our unknown sequences. We will use sequences from two loci – cytochrome b and the control region. Sequences from these two loci need to be identified separately in both the database and query sequences.

The tutorial folder contains a subfolder called Database sequences. Click on this folder in the Sources panel, and then open the control region sequences folder within it to look at how the database sequences are formatted. These sequences were downloaded from Genbank, and the sequence names have been edited so that they are in the correct format for the database.

You’ll see that the sequence name is in the format sequence name -locus. Each sequence in the database must have a unique name. For these sequences I have used Batch Rename to edit the original Genbank files so that the sequence name contains the organism, then the specimen voucher ID and/or haplotype name (if this information was on the original Genbank record), as this information may be useful for us in classifying the sequences beyond species level.

Because we will be using multiple loci to classify our sequences, the locus name (control region) has been appended to the sequence name with a specific delimiter (in this case “-“). If the sequence name up to the delimiter is identical for different loci, then pairwise alignments containing these sequences are concatenated for the overall result. Note that if you are only using a single locus you do not need to include the locus name in the sequence name.

Now check the sequences in the cytochrome b folder and you’ll see that the sequence names are in the same format, but have “-cytochrome b” appended to the end.

Now go back to the top folder and look at the Unknown sequences list. This list contains 3 query sequences, named Unknown1, Unknown2 and Unknown3. Sequences from cytochrome b and the control region are in separate files, with the locus name appended in the same way as for the database sequences

Exercise 2: How it works and choosing parameters

How it works

The sequence classifier performs pairwise global alignments with free end gaps between the query sequence and each sequence in the database. When multiple loci are used, pairwise alignments for each gene are performed separately, and these alignments are then concatenated if there are database sequences with the same name for each gene.

The “overlap identity” between your query and the database sequences is then used to determine the likely taxon of your query sequence by picking the database sequence with the highest identity to the query. The overlap identity is the pairwise identity in the region in common between the query and database sequence, where data in end gap regions is ignored. The sequence will only be classified if it meets a minimum overlap identity that the user specifies, and if multiple database sequences have similar overlap identities, then the query sequence will be classified to the taxonomic level that these sequences have in common.

It is also possible to set cutoffs for classifying sequences at various taxonomic levels. For instance, if 95% is set as a minimum identity to classify at species level, and the top match to the database has an overlap identity of 94.5%, then the query will only be classified to genus level. Thus, you need to know the approximate levels of sequence identity within and between the different taxonomic levels of your database sequences in order to choose the correct settings for classification.

Choosing appropriate parameters

Select the Unknown sequences list and open the Sequence classifier by going to Tools→Classify Sequences. In the Classification panel you’ll see options where you can set the Minimum overlap identity to classify at various taxonomic levels. We will look at a multiple alignment of all the database sequences to decide the appropriate parameters to use here.

Close the Classify Sequences window by clicking the Cancel button, and open the control region alignment document. Switch to the Distances tab and choose “% identity” in the Matrix option to display the % identity between sequences. For control region sequences, within-species identity ranges from about 95-100%, and between species identity ranges from 90-99%. For the cytochrome b alignment, within species identity is about 98-100% and between species is 93-99%. Thus, as within species identity may be as low as 95%, we should set this as the minimum value to classify at species level, and 90% as a minimum value to classify at genus level.

Exercise 3: Running the sequence classifier

Select Unknown sequences list again and open the Sequence classifier by going to Tools→Classify Sequences. Click the Settings cog down the bottom left of the window and Reset to Default if it is not already.

To set your database folder, click on the Select a folder button and choose the “Database Sequences” folder in the tutorial folder as your database.

The Sensitivity setting specifies the parameters that Geneious uses to align the query and database sequences. With a higher sensitivity setting, the search will run more slowly, but more distantly related queries will be able to be aligned to your database. In this example we are using query sequences from subfossil remains which we suspect are from kiwi, but there is a possibility they will instead be from another bird species, so we will use Highest Sensitivity/Slow as this will allow more distantly related sequences to align to the database. Keep the Minimum Overlap setting at 50bp.

Now we will set the parameters for classifying the sequences. For a description of what each of these settings does, please see the Sequence Classifier user manual. Leave Minimum overlap identity to classify and Minimum identity higher than the next best result… at their defaults of 75% and 0.2%, respectively. Under Classify using taxonomy from choose the “Database sequence organism field”, and make the Taxonomic Level Separator a space, as the genus and species names in the organism field are separated by a space.

Now set the minimum identities to classify at each taxonomic level. Remember in Exercise 2 we looked at a multiple alignment of database sequences to get a feel for what is most appropriate here. For our data, species is the lowest taxonomic level we can classify to when using the Organism field, and we found that within species identity was 95-100%. Thus, set the Minimum overlap identity to classify at lowest taxonomic level to 95%. In our alignment, between species (within genera) identity was sometimes as low as 90%, so set Minimum overlap identity to classify at second lowest taxonomic level at 90%. You can leave the third taxonomic level setting as it is, as we don’t have a third level for these sequences.

Check the Use multiple loci box and check that the delimiter is set as “-” as that is what we have used between the sequence and gene name in our sequences.

For displaying the results, in addition to the default options check Save multiple alignment of all hits per query and Save tree of all hits per query. It is possible to configure both alignment and tree building options here. Click the Alignment button and choose MUSCLE as the alignment program to use, as this will be faster than the Geneious aligner for a large dataset. We will use the default options for the Tree builder, but if you wish you can set options for bootstrapping, outgroups, and tree building method here. Also change the Highlight results in green… setting to 90%, as we want to highlight all results classified to genus level and include these in our alignments.

Your setup window should now look as in the screenshot below. Click OK to run the analysis.

Exercise 4: Interpreting the results

Open the results file produced, and you will see a number of tables. The first table Summary lists how many of your query sequences were classified into each species. For details of what each query was classified as, look at the Classifications table. The tables displayed on the right show the overlap identities between query and database sequences for each gene. When multiple loci are used, first table shows the overall overlap identities when alignments from each gene are concatenated and tables underneath are the results for each gene.

Have a look at the Classification table, and you will see that the “Unknown3” sequence was not classified. Click on this sequence in the Classifications table to bring up the identity tables for that sequence. You will see that all the entries in this table are red, meaning they are less than the specified minimum overlap identity to classify at genus level (i.e. overlap identities are less than 90%). Thus we can be confident that this sample does not come from kiwi.

Now look at the results for Unknown1 by clicking on this sample in the Classifications table. You will see that all of the top matches for this sample are Apteryx haastii (Great Spotted kiwi), with overlap identities >95%, so this sample is classified as Apteryx haastii. The results table shows the name of the sequence with the top match, so we can look at this to see which known specimen or haplotype matched our unknown sample most closely.

We will now look in more detail at the identity tables. Firstly look at second and third tables. These show the identities between the query and database sequences for each gene. Remember that the Overlap Identity is the identity in the region of overlap between the query and database sequence. The Query Identity is the identity over the entire length of the query sequence where regions from the query that do not align to the database sequences (or where database sequences are missing) are counted as mismatches. Thus, where the query identity is lower than the overlap identity it indicates that the query sequence extends outside the database sequence.

Now look at the top table. This shows the overall results when alignments for each gene are concatenated:

In this table Overlap Identity is a weighted average over both cytochrome b and the control region, weighted according to the overlap length from each contributing locus. We can see that there are two samples in the database that match the query sequence with 100% identity in the region of overlap. The first of these samples, Apteryx haastii S.25729, only has a match for the control region sequence. This sample does not have a cytochrome b sequence in the database, and this is reflected in the column Loci matching which has “1 of 2/1”. The “2/1” refers to 2 query sequences that could have matches (as there is both control region and cytochrome b sequences for the query), and 1 database sequence that could have potentially matched (as there is no cytochrome b for that database sequence). The Query Identity is low for this match because of the missing cytochrome b sequence. For the other top hits there are both control region and cytochrome b sequences, so the loci matching column has “2 of 2”.

Now click on the Unknown2 result in the Classification table. In the results tables you can see that this sample has 100% identity with a number of Apteryx mantelli (North Island brown kiwi) database sequences so it is classified as this species. However, all the top hits only have cytochrome b sequences in the database.

Alignments and trees

The multiple alignments and trees of query and database sequences are in the “Unknown sequences alignments and trees” folder which was created when you ran the analysis. You can open the tree from the results table by clicking the Alignment and Tree link next to each results table. Open the tree for the overall result for Unknown 2 in a new window so that you can see it clearly. The query sequence (Unknown2) is highlighted in green. Zoom in using the tree viewer controls to the right of the viewer, and you can see that this sequence falls in a clade containing A. mantelli sequences.

This tree can be manipulated in the same way as any tree document in Geneious. If you wish to view the alignment which underlies the tree, click the Alignment tab above the tree viewer.