Geneious Academy logo

Amplicon Metagenomics Tutorial

Learn a strategy for assembling, filtering and analyzing an NGS metagenomic data set in Geneious Prime.

Introduction

This tutorial describes a strategy for assembling, filtering and analyzing a metagenomic data set in Geneious.

Metagenomics is the study of genetic material recovered directly from environmental samples. In this example we will analyse 16S rRNA sequences PCR-amplified from naturally fermented sauerkraut, in order to profile the bacterial community associated with the fermentation process.

This dataset comprises paired-read data from a 16S rRNA amplicon spanning about 260bp of subunit V4. The sequence was generated using an Illumina MiSeq, with 2 x 250 bp read lengths. See SRR7140083 for the original Short read Archive (SRA) submission.

The key to obtaining a reliable classification of your amplicon data is to use a suitably curated database of reference sequences. The approach outlined in this tutorial firstly trims, filters and clusters sequences into OTUs using the de novo assembler. Representative sequences are then BLASTed to the preformatted 16S Microbial database from NCBI, which is a curated set of 16S sequences from bacteria and archaea type strains. In the final step the BLAST results are used as a targeted database for classifying the read set with the Sequence Classifier plugin.

For this tutorial you will need the BBDuk and Sequence Classifier plugins. You will need to install these from the Plugins menu in Geneious if you have not already done so.

Although this tutorial focusses on 16S, this pipeline can be applied to any other metagenomic marker, such as 18S, ITS, CO1, provided a suitably curated database for BLAST searching is available.

INSTRUCTIONS
To complete the tutorial yourself with included sequence data, download the tutorial and install it by dragging and dropping the zip file into Geneious Prime. Do not unzip the tutorial.

DOWNLOAD TUTORIAL

EXERCISE 1
Step 1. Preprocessing NGS amplicon data

EXERCISE 2
Step 2. Clustering reads into OTUs using the de novo assembler

EXERCISE 3
Step 3. Batch BLAST OTUs and create a taxonomy database

EXERCISE 4
Step 4. Classifying amplicon data with the Sequence Classifier

How to perform metagenomics analysis
Tutorial sequence classifier

Click on the file SRR7140083_50000. This contains 50,000 paired 16S amplicon reads, which is a subset of the full SRR7140083 dataset. We are using a subset of the data here so that the analyses can be run quickly, without a large amount of computing power.

Paired Illumina reads are normally provided as separate forward and reverse read lists in fastq format. If these are imported together, Geneious will offer to pair the sequences and create a single paired read list on import. Alternatively they can be paired once they are in Geneious using Sequence → Set Paired Reads. In this example the pairing has already been done, and the pairs are denoted by the symbols.

Trimming

Quality trimming is extremely important for amplicon metagenomic reads so that minor differences in sequences caused by PCR and sequencing error are not mistaken for real variation. We recommend using the BBDuk plugin for trimming NGS data, as it has more features than the inbuilt Geneious trimmer.

To run BBDuk on this dataset, go to Annotate and Predict → Trim with BBDuk . Set up the parameters according to the screenshot below. This will trim any remaining Illumina adaptors, remove bases below an average quality score of 30 from the ends, and remove reads that are less any 100 bp after end-trimming.

After the trimming has completed, you should see a second file called SRR7140083_50000 (trimmed) appear in your document table. This file contains 25,374 sequences so you can see that a significant amount of poor quality data has been removed. If you find that trimming to Q30 removes too much data, we suggest you reduce the trim ends threshold to Q20 but increase the minimum read length to 150 to ensure that clustering and BLAST searching is based on a long portion of the 16S sequence.

Merging paired reads

The region of the 16s rRNA amplified in this example is approximately 250 bp excluding the primer and adaptor sequences. As our reads are also 250 bp, the F and R reads are overlapping and can be merged to create a single consensus sequence for each pair. To merge reads we will use the BBMerge tool, which is built into Geneious Prime. Select your trimmed read set from the previous step, and go to Sequence → Merge Paired Reads. Use the settings shown below (Merge Rate: High) and go OK.

You will see two new files after merging. Reads that cannot be merged (usually because they are too short after quality trimming) are in the file named SRR7140083_50000 (trimmed) (couldn’t be merged). The merged reads are in SRR7140083_50000 (trimmed) (merged). Click on this file and then go to the Lengths Graph tab above the viewer. You will see that after merging a few sequences are either much shorter or longer than the expected product size of around 250 bp. The longer sequences may either be contamination or incorrectly merged sequences, so we will remove these. We also want to remove the very short sequences as these do not contain enough sequence to be correctly classified. To extract the reads we want to keep, click the Extract button on the Lengths Graph and extract sequences between 150 and 260 bp. This file should now contain 12,465 reads.

At this stage it is good practice to remove chimeric reads (which may be generated during the 16S PCR) from your dataset. The option Remove Chimeric Reads under the Sequence menu in Geneious Prime runs a reference-based implementation of UCHIIME. You will need to supply your own database (e.g. RDP-Gold) for this. For a faster analysis, it is also possible to use USEARCH if you provide the executable. As an alternative to the Geneious tools, you may also want to consider using VSEARCH using either de novo or reference approaches.

In order to save time we will not remove chimeric reads during this tutorial.

Amplicon datasets from NGS sequencing typically contain millions of reads and it is not practical to BLAST each sequence to assign taxonomy. Instead, we define operational taxonomic units (OTUs) by clustering the reads by similarity, and BLAST one representative sequence from each OTU. We can then use these BLAST results as a targeted taxonomic database with the Sequence Classifier plugin, in order to quantify the biodiversity in the full read set (described in Step 4).

In this step we will perform a de novo assembly with customized, high-stringency settings to cluster all closely-related sequences into separate contigs. The consensus sequence from each contig will represent an OTU. This exercise should simplify our dataset significantly and provide a reduced dataset for a subsequent batch-BLAST.

Select the trimmed, merged and length filtered read set you prepared in the previous exercise (SRR7140083_50000 (trimmed) (merged) – length 150-260) and go to Align/Assemble → De novo assembly. Set Sensitivity: to “Low sensitivity / Fastest” to load high stringency settings, then set Sensitivity: to Custom Sensitivity and adjust the settings to those as shown below. This will ensure each contig will comprise only closely related sequences.

Click OK to start the de novo assembler.  The de novo assembly process may take 5-10 min. Once completed, you’ll see a new folder containing the assembly reports under the tutorial folder. Select this folder and then view the assembly report.

With our data set the above settings should return 58 contigs and 86 unused reads. A consensus sequence for each contig is in the Consensus Sequences list, and you should have another list containing the unused reads, representing the unclustered unique sequences. In the next step we will BLAST the sequences from these two lists.

In this exercise we will create a curated database specific to our dataset by blasting the OTUs we generated in the previous step to the preformatted NCBI 16S Microbial database. This will enable us to pull out the most relevant accessions to use for taxonomic classification of the entire dataset.

1. Installation of local 16S Microbial database

We will BLAST to the 16s Microbial database from NCBI, which is a curated set of 16S rRNA sequences from bacteria and archaea type strains. This is a relatively small database and it is faster to set up a local copy of the database to BLAST to rather than sending the sequences to NCBI.

To BLAST to a local database you must firstly install the custom BLAST executables if you have not already done so in the past. To do this go to Tools → Add/Remove Databases → Set up BLAST services. Change the Service to Custom BLAST, check Let Geneious do the setup, note the database location and click OK. This will download and install the BLAST program at the location specified. Once the download has finished, navigate to this location on your drive and find the BLAST/data folder. This is where you will put the preformatted files from NCBI.

To get the preformatted BLAST database files, click on this link to go to the BLAST ftp site. The first file in the list should be 16SMicrobial.tar.gz. Click on this file to download it, and then uncompress the file. Once uncompressed, you should have a folder called 16SMicrobial containing files with names like 16SMicrobial.nni, 16SMicrobial.nsi, etc. Move all the files out of the 16SMicrobial folder and into the BLAST/data folder that was created when you set up custom BLAST.

Once you have added the 16S Microbial files to your BLAST/data folder, restart Geneious Prime so that it picks up the new database. You are now ready to run the BLAST search.

Further instructions for installing custom BLAST and using preformatted BLAST databases are in chapter 15 of the User manual, and this post on the Geneious Knowledge Base.

Note: If you are not working with 16S sequences and do not have a suitable stand-alone BLAST database, it is possible to blast to NCBI’s nt database (with some Entrez filters to exclude environmental, uncultured and unclassified sequences) for this step. Be aware that this will take a lot longer to run than using a local BLAST database targeted to your amplicon sequence.

2. Batch BLAST OTU consensus sequences

Select both the Consensus Sequences and Unused Reads lists then click the Blast button.

Under the Database dropdown menu, you should now see the 16S Microbial database you created – select this as the database. Set the rest of the settings as in the screenshot below. We will only return the top hit (Maximum Hits=1) and retrieve the matching region with annotations, as this will retrieve the taxonomic information from the database as well.

Under the advanced options you can increase the speed of the BLAST search by increasing the number of CPUs. You should set this to 1 less that the total number of CPUs on your machine (e.g. if your computer has a quad core processor, set it on 3).

Click Search to run the BLAST. Note that you will get an Ambiguous Query warning due to some consensus sequences containing ambiguities. Just go OK to continue.

Once the search has finished, you will see a dialog stating that a small number of sequences had no results. These are most likely either contaminant or incorrectly merged sequences. For each query that does return a result you’ll see one BLAST alignment document in the document table. Select the Alignment View tab to view the alignment between the query and the hit for each document.

3. Creating a sequence classifier database from the BLAST hits

Once the BLAST results are returned we need to do some processing in order to get them in a format where they can be used as a database for the Sequence Classifier. We will perform the following steps:

A. Remove duplicates
B. Download the full hits
C. Extract the BLAST HIT regions
D. Create a database for the Sequence Classifier tool


3A. Removing Duplicates

Due to the high stringency of our de novo assembly, it is likely that some OTU’s will have returned the same best Blast hit, so we should check and remove duplicate sequences.

Select the folder containing the Blast hits, then select all the returned Blast hits and go menu Edit -> Find Duplicates.  Set the options to; 

Find: Documents with the same name
Search Scope: Current Folder
What to do: Select most recently modified duplicates

This will select all duplicate files.  Hit delete to remove the duplicates. This will have reduced the number of BLAST hits in the folder from 140 to 47.

3B.  Download the Hits

At this stage our Blast hits are only summaries and we need to download the full sequences.  Select all sequences in the Document Table and hit the Download Full Sequence/s button.  This may take a minute or so to complete.

3C.  Extract the Hits region

To keep our database small and speed up the classification process we will now extract only the regions of the Blast hits relevant to our amplicon.

To do this, select all sequences in the Document Table, click on the Annotations tab, and under Type, select Blast Hit. Click in the table and use control/command-A to select all, then click the Extract button to extract all Blast hit regions to a sequence list file.

The list file created will by default have a name something like “Extraction of 47 annotations”.

3D.  Creating the database

This step is very simple.  We will simply place the list file we just created into a new folder to “create” a 16S database.

To do this, Select a suitable location in the Sources panel, right click and choose New Folder.

Give the folder the name SRR7140083 16S database.

The go back to the location of your new list file and drag it into the new 16S database folder.

Renaming the database entries

In the last step we will rename the extracted Blast-hits and give them the name of their source organism.

To do this, Select database list and go menu Edit → Batch Rename and use the settings shown below.  

When you select the first dropdown list for the Replace with setting you may notice that there are two occurrences of Organism.  The first refers to metadata associated with the list file, the second refers to metadata associated with individual sequences within the list. Be sure to select the second occurrence of Organism from the Replace with dropdown list.

Once you have selected the appropriate settings, go OK.  You will then be shown a preview window that will allow you to confirm that the rename operation will change each sequence name from an accession number to a binomial species name.

We will now use the Geneious Sequence Classifier plugin to analyse our merged Amplicon dataset, using our newly created 16S database. Note that we will be using the original list of merged reads, not the OTUs.

If you do not have the Sequence Classifier plugin, install it now by going to Tools → Plugins, and choosing “Classify Sequences” from the list of plugins to install.

Select the file of processed reads you created in Step 1. These are the reads you trimmed, merged and length filtered, and it should be called SRR7140083_50000 (trimmed) (merged) – length 150 to 260 . Then go to Tools → Classify Sequences.

First, click on the Database Folder and select your newly created SRR7140083 16S database to be the classification database.

Because the 16S amplicon sequences in this dataset come from only the V4 region of 16S, they do not contain enough resolution to classify to species level. We will use the taxonomy field on our database sequences to classify the amplicon reads to genera level based on the % pairwise identity with the database sequences.

To do this, set up the classifier with the settings shown in the screenshot below. First set the Sensitivity on High Sensitivity/Medium and the Minimum Overlap to 100 bp. Under the classification settings, choose Database sequence taxonomy field to classify from, and set the minimum overlap identity for the lowest taxonomic level to 95%, and then the subsequent levels to 90% and 85%. Note that in our example, the lowest taxonomic level is genus rather than species – 95% sequence identity is traditionally regarded as a appropriate cutoff for 16S for assigning sequences to genera.

Click OK to run the classifier.  Once completed the results will be written to a report document.  

The report document comprises three tables labelled Summary, Classifications and Results.

The Summary table lists how many of your sequences were classified using your database according to the criteria. The number of unclassified sequences are also displayed in the list.  

The Classifications table lists all of the sequences submitted for classification and provides details of the match used to make the classification.  If you select any sequence in the Classifications table, then details on individual “hits” to that sequence will be displayed in the Results table.

In this case you can see that this dataset is comprised mainly of Leuconostoc and Lactobacillus species, which are known to be the dominant species in sauerkraut fermentation. A smaller number of reads can only be classified to Family (Leuconostocaceae or Lactobacillaceae) or Order (Lactobacillales) level.

To export the tables for further analysis, select an entry within the Summary, Classifications, or Results tables, then click Export Table. This exports the selected table in .csv format.

For further information on the Classify Sequences tool you can download the manual from the following link: Sequence_Classifier_Manual.pdf.

We also have a separate tutorial specifically on the Sequence Classifier which is available from the following link Sequence Classifier Tutorial.