De Novo Assembly Tutorial
In this tutorial you will perform a de novo assembly of short-read next-generation sequencing data. You’ll learn about how to work with paired-end data and how to check the quality of your assembly against a reference sequence.
In this tutorial you will learn how to process and de novo assemble next-generation sequencing data. The tutorial includes an overview of pairing, trimming and filtering steps that should normally be undertaken prior to assembly, and some general advice for de novo assembly.
In this exercise we will use the Geneious de novo assembler. Geneious Prime also includes a number of other third party de novo assemblers including Spades, Tadpole, Velvet and MIRA. See Which de novo assembler is best for my data for more information on the other assemblers.
The example data provided with this tutorial are Illumina reads extracted from Sequence Read Archive (SRA) entry PRJEB10693. These data comprise paired Illumina MiSeq reads with a raw read length of 149 bp and an expected average insert size of 350 bp. The sequences are derived from the the genome of Escherichia coli str. K-12 substr. MG1655 (Full genome available at NC_000913).
To keep this tutorial a practical size a subset of 7800 paired reads is provided. These reads map to a 10,343 bp portion of the reference genome at coordinates 4,190,266-4,200,608. This region contains 12 complete coding sequences (CDS).
This tutorial requires the BBDuk Trimmer and the MAFFT multiple aligner which are both available as plugins. Go menu Tools → Plugins to install the BBDuk and MAFFT plugins.
To complete the tutorial yourself with included sequence data, download the tutorial and install it by dragging and dropping the zip file into Geneious Prime. Do not unzip the tutorial.
Overview: Best Practice for preprocessing of NGS reads
Proper preprocessing of your NGS reads will improve assembly accuracy and also will usually significantly reduce the computation and time required to complete assembly.
If you have paired data then your first step should always be to Set paired reads, followed by trimming, then if required, followed by other preprocessing steps as depicted in the following flow diagram.
Importing/Pairing your NGS data
An NGS sequence service provider will normally provide Illumina paired read data as two separate forward and reverse read lists in fastq format. Usually standard Illumina adapters will have been trimmed by the service provider. In most cases the fastq lists will be compressed by gzip (.gz). Geneious can import compressed or uncompressed fastq files.
If you import forward and reverse read files together via menu File → From Multiple files then Geneious will offer to pair the files and create a single paired read list. Similarly, if you drag and drop pairs of read lists into the Geneious window then you will be given the option to pair the reads during the import process.
Geneious will determine the likely read technology, so you only need to set the expected insert size (the expected average insert size excluding adapters) and hit OK.
The output from the Pairing operation will be a single list of interlaced forward and reverse reads.
Manually pairing read lists
If you have already imported your reads as separate lists then you can pair after importing by selecting the lists and going menu Sequence → Set paired reads.
It is important to trim reads prior to assembly. Incorrect low quality calls at sequence ends will potentially prevent proper assembly and increase the computation and time required to perform assembly.
Geneious provides the BBDuk trimmer as a plugin which can be installed via menu Tools → Plugins. BBDuk (Decontamination Using Kmers) is a fast and accurate tool for trimming and filtering NGS reads. The plugin allows you to trim adapters using presets for Illumina adapters, trim ends by quality, trim adapters based on paired read overhangs, and discard short reads (and associated pair mate) that are trimmed to below a minimum length.
The BBDuk trimmer can be accessed via menu Annotate & Predict → Trim using BBDuk.
The BBDuk Minimum Quality: “Q” value is a Phred score (modified Mott algorithm). The following table shows examples of how Q correlates to % Likelihood. Choosing an appropriate Q value will depend on the overall quality of your data. In general, trimming harder (by setting a higher Q value) will improve subsequent assembly speed and quality provided it does not trim a significant proportion of your read data. For illumina reads we recommend setting a minimum Q value of 20.
|Q value||% Likelihood call will be correct|
Other preprocessing steps
See the last section of this tutorial if you want to see a a brief summary of other preprocessing tools available for NGS read data. Otherwise click on the following link to Exercise 1 to move to the next section of this tutorial.
Exercise 1: NGS read preprocessing
Set Paired Reads
This tutorial provides two unpaired read lists SRR513053 subset F and SRR513053 subset R. Select these two lists and go menu Sequence → Set Paired Reads.
Choose Pairs of documents and then choose Forward/Reverse (inward pointing) with an Expected Distance of 350, confirm Read Technology is set to Illumina Paired End then click OK.
This operation will output a new paired list called SRR513053 subset.
Trim with BBDuk
Select the new paired list, go menu Annotate & Predict → Trim using BBDuk. Set to Trim adapters, set to Trim low quality, set Minimum Quality to 20, set to Trim Adapters based on paired read overhangs, with minimum overlap of 20. Finally check Discard Short reads, with minimum length 20.
The BBDuk trim operation will output a new paired list called SRR513053 subset (trimmed).
If you select the trimmed list and scroll through the list you will see that many reads are now trimmed shorter than the original 149 nucleotides. To see the command line output from BBDuk, including list of how many bases and reads have been removed during trimming, click the Info tab above the viewer.
In Exercise 2 we will assemble the trimmed paired reads using the Geneious de novo assembler.
Exercise 2: De novo assembly of paired-end data
De novo assembly of the full trimmed data set
Select the list of paired trimmed reads list generated in Exercise 1, called SRR513053 subset (Trimmed), and in the Toolbar click the Align/Assemble button and choose De novo Assemble.This will open the De novo Assembler Settings window.
The settings window is divided into 5 sections, Data, Method, Trim, Results and, via the More Options button, an Advanced Settings section. For an overview of the various settings see section 10.3 of the online manual.
The Geneious de novo assembler parses your input data and will select the appropriate Sensitivity: setting to use. In most cases you will not need to adjust the Sensitivity setting. The assembler also estimates and reports the amount of memory expected to be required to perform the assembly. The assembler will warn you if it believes you will not have enough RAM to assemble your selected data set.
The Sensivity setting adjusts various Advanced options. If you wish to see the how these Advanced settings change depending on sensitivity, click on the More Options button in the bottom left corner of the Settings window, change the Sensitivity setting and observe how the Advanced options change. Hover over each Advanced setting to see a Tooltip describing the setting. If you wish to modify specific Advanced settings then set Sensitivity to Custom Sensitivity.
Running de novo assembly
In this exercise we will use the Geneious de novo assembly algorithm using the default settings. To ensure you are using default settings, click on the cog in the bottom left corner of the Window and choose Reset to defaults.
In the Results section of the window, check the options to Save an Assembly report, Save contigs and Save consensus sequences.
The Trim section of settings Window is primarily for Sanger reads. Ensure the setting is set to Do not trim.
Click OK to start the assembly, noting the time it takes for assembly to complete. We will compare this time with that taken after normalization.
Viewing the assembly results
Upon completion of assembly three new files will be written, an assembly report, an assembly file, and a consensus sequence generated from the assembly.
Select the Assembly report to view it. In this exercise all reads should assemble as a single contig so the report will be simple. For more complex assemblies resulting in multiple contigs various statistics including N50 will be reported.
The Assembly report provides a Show options link that opens the Assembly settings window to show you the Assembler options used during assembly.
Select the contig file to view it. By default assembled paired reads will be colored according to their assembled paired distance. The color will depend on how the paired distance differs from the Expected distance that was set when you paired the reads. Select the Home tab in the side panel, click on the Options link to see the color scheme.
Use the zoom controls to zoom in and see the sequences at the nucleotide level.
Click on the Statistics tab in the viewer panel to see the Mean Coverage for the assembly file.
Click on the Graphs tab in the viewer panel to see the settings that can be adjusted for the Blue coverage graph. These settings allow you to identify areas of high/low or single stranded coverage.
Click on the Insert Sizes tab at the top of the viewer panel to see a histogram showing the distribution of calculated insert sizes based on the assembly of the paired reads. In this example we can see that the mean paired distance is very close to the predicted expected insert size of 350 bp.
De novo assembly of a normalized data set
De novo assembly of large data sets requires significant RAM and computational time. We will now “normalize” our trimmed paired read list and assemble again.
Select the SRR513053 subset (trimmed) list and go menu Sequence → Error Correct and Normalize reads. Use the default settings, but uncheck the option for Error correction. Click OKto run, the tool when completed will output a new normalized list called SRR513053 subset (trimmed) (normalized).
Select the normalized list, it should contain 3776 reads, a close to 50% reduction compared to the original list of 7800 reads.
Now select the normalized list and repeat de novo assembly, again noting the time it takes for assembly to complete. You should find assembly of the normalized data set takes around 1/3 the time for the full 7800 read data set. If you select the “normalized” contig and view the Statistics tab you will see this contig has an average coverage of 52.3.
Select the two consensus files generated for each assembly (named SRR513053 subset (trimmed) Assembly 2 and SRR513053 subset (trimmed) (normalized) Assembly 2 and also the reference sequence, E. coli K12 MG1655 (NC_000913) ref extraction, provided with this tutorial.
Click the Align/Assemble button in the Toolbar, select Multiple Align and choose to align with the MAFFT alignment tool. Check the option to Automatically determine sequences’ direction. Click OK to align.
Select the output Nucleotide alignment to view. You should see that the Identity graph at the top of the alignment is a solid green bar, indicating that all three sequences are identical at all positions in the alignment.
This last exercise shows that normalisation can significantly reduce the time required for assembly with no loss in accuracy of the assembly. For more information on using normalization in combination with de novo assembly see the Summary section of this tutorial.
This exercise has used the Geneious de novo assembly algorithm. A number of other algorithms are available in Geneious. Several are bundled with Geneious others are available as plugins. See the following Knowledge Base article for a brief summary of the various assemblers.
General advice for de novo assembly
De novo assembly is one of the most RAM intensive and CPU intensive tasks that can be undertaken in Geneious Prime. See our Knowledge base post for an overview of hardware requirements for de novo assembly using the Geneious de novo assembler.
To improve quality of assembly and reduce the time and RAM required for assembly you should always:
1. Trim ends using BBDuk. Trim stringently with a “Q” value of 20 or greater.
2. Aim for assembly coverage of 50-100x. Higher coverage will not usually improve final results and will increase RAM requirements and time for assembly. See the following Link for information on how to calculate expect coverage based on your expected genome size, average read length (trimmed), and number of reads.
To de novo assemble a subset of your reads use one of the following options:
a) In the De novo assembly settings window check the option to Use X% of data and set an appropriate value based on coverage calculations. This will always use the first X% reads in your list.
b) Use Workflows → Randomly sample sequences to create a randomly sampled subset (in pairs) of your read list.
c) Use Normalization to reduce the size of your data set as described below.
Will de novo assembly give me a complete contiguous genome?
All de novo assembly algorithms, regardless of the technique used, are unable to unambiguously assemble across perfect repeats when the repeat unit is longer than the read length or the paired read insert size. In practice this means that assembly of genomic Illumina short read data will in most cases result in the creation of multiple nonoverlapping contigs.
For example, microbial genomes usually contain multiple copies of the rRNA gene cluster which each contain close to perfect repeats of the SSU (16S) LSU (23S) and 5S rRNA genes. These identical rRNA clusters will prevent assembly and recovery of a single contiguous consensus sequence. In most cases other repeat units (duplicated genes, transposons etc) will also generate further “breaks” in the de novo assembly.
All reads derived from the repeats will end up assembled together at the end of a single contig. As a consequence, coverage of the assembled repeat region will be a higher than the average coverage for the unique portion of the genome.
Other preprocessing tools
Error correct and Normalize reads
Accessed via menu Sequence → Error correct and normalize reads.
The Error correct and Normalize reads tool utilizes BBNorm, which is designed to normalize assembly coverage by down-sampling reads in high-depth areas of a genome, resulting in a more even coverage distribution. Importantly, normalization will not remove reads in lower coverage areas.
Normalization can substantially reduce data set sizes, and subsequently, for de novo assembly, it will reduce assembly time and reduce RAM requirements.
Note that normalization can potentially “amplify” errors in “difficult to sequence” areas to the point the errors appear significant. Therefore, if you use normalization then we recommend using the strategy depicted in the flow diagram shown below. This combined normalization/de novo assembly/map to reference strategy will usually still be far faster than attempting de novo assembly of your full data set.
Merge paired Reads
Accessed via menu Sequence → Merge paired reads
This tool utilizes BBMerge and is designed to merge two overlapping paired reads into a single read. This tool is useful generating a consensus from overlapping reads generated by amplicon sequencing.
Remove duplicate Reads
Accessed via menu Sequence → Remove duplicate reads
This tool utilizes Dedupe and is designed to find and remove all contained and overlapping sequences in a dataset.
Accessed via menu Sequence → Remove chimeric reads.
This tool will filter chimeric reads from sequencing data by comparing to a reference database. You can choose between the bundled public domain UCHIME algorithm or download and use the faster USEARCH 8. Note that the free version of USEARCH 8 is limited to using 4 GB of RAM and so cannot handle larger NGS datasets.
Accessed via menu Sequence → Separate by barcodes.
This tool will demultiplex custom barcoded data into separate lists. The tool has 454 MID barcode presets, or you can define and use your own custom barcode sets.
Note: Demultiplexing should always be performed before trimming using BBduk.