Geneious Academy logo

Mapping and SNP Calling Tutorial

Learn how to perform a reference assembly with next-generation sequencing (NGS) data, and to call SNPs on the assembled contig. You will learn the typical steps of an NGS workflow such as quality trimming, read pairing, and generating a consensus sequence.

Introduction

In this tutorial you will learn how to perform a reference assembly with next-generation sequencing (NGS) data, and to call SNPs on the assembled contig. You will learn the typical steps of an NGS workflow such as quality trimming, read pairing, and generating a consensus sequence.

This tutorial requires the BBDuk plugin. To install this plugin, go to Tools → Plugins find BBDuk Trimmer in the list of available plugins, and click Install.

INSTRUCTIONS
To complete the tutorial yourself with included sequence data, download the tutorial and install it by dragging and dropping the zip file into Geneious Prime. Do not unzip the tutorial.

DOWNLOAD TUTORIAL

OVERVIEW
NGS preprocessing best practice

EXERCISE 1
NGS read preprocessing

EXERCISE 2
Map to Reference

EXERCISE 3
Exploring contig documents

EXERCISE 4
Calling SNPs

EXERCISE 5
Comparing SNPs

Overview: Best Practice for preprocessing of NGS reads

Proper preprocessing of your NGS reads will improve mapping accuracy and will also usually significantly reduce the possibility of false positive SNP calls.

Your first step should always be to Set Paired reads, followed by trimming, then if required, other preprocessing steps.

Importing/Pairing your NGS data

An NGS sequence service provider will normally provide Illumina paired read data as two separate forward and reverse read lists in fastq format.  In most cases the fastq lists will be compressed by gzip (.gz).  Geneious can import compressed or uncompressed fastq files.

If you import forward and reverse read files together via menu File → From Multiple files then Geneious will offer to pair the files and create a single paired read list. Similarly, if you drag and drop pairs of read lists into the Geneious window then you will be given the option to pair the reads during the import process.

Geneious will determine the likely read technology, so you only need to set the expected insert size (the expected average insert size excluding adapters) and hit OK.

The output from the Pairing operation will be a single list of interlaced forward and reverse reads.

Manually pairing read lists

If you have already imported your reads lists as separate lists then you can pair after importing by selecting the lists and going menu Sequence → Set paired reads.

NGS Trimming

Accessed via menu Annotate & Predict → Trim using BBDuk.

It is important to trim reads prior to assembly.  Low quality calls at sequence ends will potentially prevent proper assembly and increase the computation and time required to perform assembly.

Geneious provides the BBDuk trimmer as a plugin which can be installed via menu Tools → Plugins.  BBDuk (Decontamination Using Kmers) is a fast and accurate tool for trimming and filtering NGS sequencing data. The plugin allows you to trim adapters using presets for Illumina adapters, trim ends by quality, trim adapters based on paired read overhangs, and finally discard short reads (and their pair mate) that are trimmed to below a minimum length.

The Quality (Q) value is a Phred score. The following table shows examples of how Q correlates to % Likelihood.  Choosing an appropriate Q value will depend on the overall quality of your data.  Generally trimming harder (by setting a higher Q value) will improve subsequent assembly provided it does not remove a significant proportion of your data.  For illumina data we recommend a Q value of 20.

Q value % Likelihood call will be correct
6 75
10 90
13 95
20 99
30 99.9

Exercise 1: NGS read Preprocessing

In this tutorial, we will work with a dataset of Illumina sequence reads which map to a single gene in the Escherichia coli genome. In this exercise we will prepare the data for mapping by trimming the poor quality bases.

Read-pairing

Most next-generation sequencing platforms such as Illumina, Solid, Ion torrent and 454 provide the option of paired-end sequencing, those reads need to be paired (see Overview). In this tutorial, this step has already been done and you are provided with one file of paired reads, with an insert size of 500 bp. Select this file, and you’ll see that each sequence is tagged with a forward or reverse tag.

Trim with BBDuk

Select the paired list, go to menu Annotate & Predict → Trim using BBDuk. Set to Trim adapters, set to Trim low quality, set Minimum Quality to 20, set to Trim Adapters based on paired read overhangs, with minimum overlap of 20. Finally check Discard Short Reads, with minimum length 20.

The BBDuk trim operation will output a new paired list called yghJ paired Illumina reads (trimmed).

Exercise 2: Map to Reference

Holding down the shift key, select your file of trimmed reads, and the reference sequence (yghJ CDS). Click Align/Assemble → Map to Reference, and reset to the default setting using the settings cog at the bottom left of the window.

Geneious Prime will guess which of the sequences you have selected is the reference, but you can change this using the drop-down menu. In this case it has chosen correctly, and “yghJ CDS (divergent reference)” should be shown as the reference sequence.

Note that in Geneious 8.1 and above, the reference sequence does not have to be pre-selected before opening the Map to Reference window. Instead, it can be selected from any folder in your database by clicking the Choose button within the setup options.

In the Method panel, ensure Geneious is chosen as the Mapper. A number of different mapping algorithms are available here, and these have varying strengths and weaknesses depending on the type of data you are assembling. A brief outline of the different algorithms available is given here.

The Sensitivity should be set on Medium Sensitivity/Fast. Geneious Prime will automatically choose an appropriate sensitivity based on the size of your dataset. For next-generation sequencing, Medium or Medium-Low sensitivity is recommended, as using High Sensitivity will take a long time and is unlikely to improve the results if you have sufficient coverage. The Fine Tuning option can improve the results by aligning reads to each other in addition to the reference sequence – set this on “Iterate up to 5 times”.

Ensure Do not trim is selected, and under the Results panel, choose Save assembly report and Save contigs. Uncheck “Save in subfolder” if it is selected. Your Map to Reference options should now look like this:

Click OK to run the assembly – it may take a few minutes to complete.

Two new documents should now be created in your document table – a contig of the reads mapped to the reference, and an assembly report. Open the assembly report and you will see how many reads were assembled, how long it took, and what contigs were produced. We will explore the contig document further in the next exercise.

Exercise 3: Exploring contig documents

Open the contig document (which should be called “yghJ paired Illumina reads (trimmed) assembled to yghJ CDS (divergent reference)”) to see how the reads map to the reference sequence. Under the Advanced settings tab to the right of the sequence viewer, ensure that “Vertically compress contig” is checked. This will display the reads in horizontal rows. Now go back to the General tab and ensure that Colors is set on “Paired Distance”. This setting allows you to see at a glance whether your paired-end reads map at the expected distance apart, based on the insert size you specified when you set up the paired reads. In this contig you can see that most of the reads are green, meaning that they map with approximately the expected insert size. Click on the Insert sizes tab above the sequence viewer to see the actual distribution of insert sizes. You can see that most pairs map with inserts of 450-500 bp, close to the expected size of 500 bp.

Switch back to the Contig View. At the top of the contig you’ll see a consensus sequence. Zoom in so that you can see the sequence bases. This is the consensus of the reads only and does not include the reference sequence. The settings used to call the consensus are set under the Display tab to the right of the sequence viewer. Because these reads have quality scores attached, Highest Quality should be chosen as the Threshold for calling the consensus sequence. This setting calculates a majority consensus that takes the into account the relative quality scores for each base at that position (see the Geneious user manual for more information).

Now change the Threshold to “100% – Identical”. You should see a number of ambiguous bases appear in the sequence. Under this setting an ambiguous base will be inserted anywhere there is a mixture of bases among the reads, even if only one read out of a thousand contains a different base. This setting should not be used for mapping NGS data, as it will introduce ambiguities in the consensus sequence that are the result of read errors, rather than real polymorphisms. If you do not have quality scores on your reads, then a Threshold of 90-99% is most appropriate for ensuring only real polymorphisms appear as ambiguous bases in your consensus sequence. Change the setting to 95% to see how it affects the ambiguities, then change it back to “Highest Quality” for the remainder of the tutorial.

Underneath the consensus sequence you should be able to see a blue coverage graph as in the screenshot below. If you can’t see this, click on the graphs tab to the right of the sequence viewer and enable “Show graphs” and “Coverage”. The coverage graph shows how many reads map at each base and can be useful for assessing the quality of your mapping.

There are two tools available in Geneious Prime that allow you to quickly identify regions of high or low coverage:

1. Under the Graphs tab you can highlight regions above or below a certain coverage. Check “Highlight above” and set this on 50. You should now see a yellow bar under the coverage graph covering regions where the coverage is greater than 50.

2. You can annotate regions of high or low coverage by going to Annotate and Predict → Find Low/High Coverage.

We will use the second option to annotate regions of low coverage so that we can exclude these regions when we call SNPs in the next exercise. Check Find regions with coverage below and choose Standard deviations from the mean = 2. Check both Merge regions options and uncheck the High Coverage options. Click OK. You should now see a Coverage annotation track on the reference sequence, which has annotations in regions of low coverage. Click Save and choose “Yes” when asked if you want to apply changes to the original sequence.

Exercise 4: Calling SNPs

We will now use the Geneious variant finder to find SNPs in our mapped data. Select the contig document and go to Annotate and Predict → Find Variations/SNPs. Reset to the default settings using the Settings cog at the bottom left.

The options in the top (“Find Polymorphisms”) panel allow you to set the parameters for when SNPs are called, so that disagreements that result from sequencing errors are filtered out. We will use the default settings as they are normally appropriate for identifying real SNPs – if you want more information on these settings click the “?” button or mouse over the option.

Ensure that the option to Analyze effect of polymorphisms on translations is checked, and change the Default Genetic Code to “Bacterial”. This uses the CDS annotation on our reference sequence to determine the coding sequence of our mapped reads, and calculates whether observed SNPs will result in a change in the amino acid sequence.

Leave the other options as they are and click OK

You should now see an annotation track called “Variants: yghJ paired Illumina reads” added to the reference sequence. Click Save and choose “Yes” when asked if you want to apply changes to the original sequence. This will load your SNP track onto the original reference document.

Scroll along the contig document to a position containing a SNP annotation (denoted by a vertical yellow bar). Mouse over the annotation and you’ll see a popup window containing information about that SNP. This includes the base change, variant frequency, SNP type, and information about the protein and CDS changes.

To display this information in a table, click the Annotations tab above the sequence viewer. This will bring up a table of all the annotations on your sequence. Click Type and choose “Polymorphism” to display only polymorphism annotations. The table should automatically be showing the relevant columns, such as Polymorphism Type, Variant Frequency, Amino acid/Codon/base change etc. To bring up additional columns or remove existing columns click the Columns button and add/remove the columns you want.

Once you have the table looking the way you want it to, you can export it to a spread sheet by clicking Export table. This will export your table in comma-separated (.csv) format.

Exercise 5: Comparing SNPs

Geneious Prime has a Compare Annotations function which allows you to filter SNPs based on their overlap with another annotation track or annotation type. We will use this function to filter out SNPs that are in regions of low coverage*.

In Exercise 3 we created an annotation track called “Coverage” to identify regions where coverage falls to below two standard deviations from the mean coverage. Compare annotations will allow us to exclude annotations from the Variants track that fall within the low coverage annotations.

Select your reference sequence. This should now have the tracks you created in the earlier exercises displayed on it. Go to Annotate & Predict → Compare Annotations. You can specify the annotations you wish to compare in the Annotation Types panel. For Set A select “Polymorphism” “on track Variants: yghJ paired Illumina reads”, and for Set B select “Low” “on track Coverage: yghJ paired Illumina reads”. Under Comparison, uncheck Names must match, as the polymorphism and coverage annotations have different names. Check Allow intervals to partially match…, as this will return polymorphisms such as indels that are only partly within a low coverage annotation.

Under results check A-B. This will return polymorphism annotations that do not overlap with coverage annotations – the venn diagram to the right, and the “Example” panel displays this graphically. The window should now look as in the figure below.

Click OK, and you should now see a third annotation track on your reference sequence called “Variants: yghJ paired Illumina reads – Coverage: yghJ paired Illumina reads”. This track has fewer polymorphism annotations than the original Variants track. Scroll along to regions of the sequence marked as low coverage and you weill see that there are a number of polymorphism annotations in the original Variants have been excluded from the new track.

Click Save to record the new track on your reference sequence.

*Note: For demonstration purposes in the tutorial, we’ve filtered out low coverage SNPs using Compare Annotations. However, you can set up the SNP finder to automatically filter out low coverage SNPs without needing to perform this step. To do this, check the Minimum Coverage box in Find Variations/SNPs and enter the minimum coverage required for a SNP to be called.

Mapping algorithms available in Geneious Prime

There are a number of mappers available in Geneious Prime. Some mappers are not bundled with Geneious Prime but may be installed as optional plugins from Tools -> Plugins. The best mapper to use may depend on your data. Below is a brief overview of the advantages and disadvantages of some mappers.

Geneious

Advantages:

  • Fast
  • High sensitivity
  • Iterative mode to extend past ends of reference sequence and map ends of reads correctly around indels
  • Can discover structural variants
  • Supports circular reference sequences (maps correctly around the origin)
  • Supports soft trimmed reads
  • Can map existing alignments and de novo assembled contigs to reference sequences
  • Provides progress during mapping

Geneious for RNA Seq

Use this when mapping RNA sequence reads to a genome with introns.

Advantages:

  • Can map reads that span existing annotated introns
  • Can discover novel introns and map ends of reads correctly around these novel introns
  • Can discover fusion genes
  • Provides progress during mapping

Disadvantages:

  • Novel intron and fusion gene discovery is a little slow

BBMap

Advantages:

  • Fast
  • High sensitivity

Bowtie2

Advantages:

  • Low memory usage
  • Widely used

Tophat RNA Seq

Advantages:

  • Discovers novel introns
  • Widely used

Disadvantages:

  • Not available on Windows