Expression Analysis with DESeq2 Tutorial
Learn how to use DESeq2 in Geneious Prime to compare expression levels for two sample conditions with replicates.
In this tutorial you will learn to compare expression levels using RNA-seq data from 2 sample conditions, where each sample has 3 replicates.
The basic Geneious Expression Analysis tutorial covered the use of Geneious Prime’s built-in RNAseq expression analysis tools. This tutorial uses Geneious Prime’s implementation of the R package DESeq2, which should be used when you have multiple replicates for each sample condition.
About the data
This tutorial uses a sample dataset from Vibrio fischeri, a marine bioluminescent bacterium which is the monospecific symbiont of the Hawaiian bobtail squid, Euprymna scolopes. These bacteria colonize the light-emitting organ of E. scolopes, providing a light source for the squid’s nocturnal antipredatory behavior.
The RNAseq datasets are from a study by Luke Thompson et al, investigating transcriptional changes that the bacteria undergo in the early stages of colonizing the squid host. RNAseq experiments were performed on V. fischeri cells in culture, cells that were incubated in artificial seawater, and cells collected during expulsion from the squid host in order to look for changes in gene expression that were directly associated with host colonization.
The full datasets from this study can be downloaded from NCBI, under BioProject PRJNA319479.
To complete the tutorial yourself with included sequence data, download the tutorial and install it by dragging and dropping the zip file into Geneious Prime. Do not unzip the tutorial.
Exercise 1: Data preparation and mapping
You have been provided with 6 sets of reads, representing two different sample conditions. Three are from bacteria incubated in seawater, simulating planktonic conditions (Plk), and three are from bacteria collected immediately after venting from squid (Vnt). All datasets are 100bp single end reads, produced from an Illumina HiSeq2500. The reads have been quality trimmed with the BBduk plugin to remove low quality bases.
To assist with assigning sample conditions in the DESeq2 analysis, the sample condition has been added as a metadata field on each sequence list. To view this, select one of the sequence lists and click the Info button above the sequence viewer. You will see a “Sample condition” metadata field that is assigned as either “planktonic” for Plk sequence lists, or “Squid-associated” for Vnt lists.
Now we will map the reads to the reference sequence. Select all 6 the read lists (hold down shift to bulk-select), and go to Align/Assemble → Map to Reference. Set the reference sequence to the provided file NC_006841 using the Choose button. Because we want to create separate assemblies for each read set, check the option to Assemble each sequence list separately. Under Results choose to Save contigs, Save in sub-folder, and Save assembly report. All the other settings can be left as the defaults.
Note: if you are working with a genome containing introns, use the Geneious RNA mapper or Tophat plugin to map your reads instead of the standard Geneious mapper
Your settings should now look as in the screenshot below. Click OK to start the mapping.
Once the mapping has finished (it may take some time), you will see 6 assembly documents in a new subfolder.
Exercise 2: Calculating Expression Levels
In order to compare expression levels between the 2 sample conditions, you must first calculate the expression levels for each replicate. To do this, select all the assemblies produced in Exercise 1, and go to Annotate and Predict → Calculate Expression Levels. Click Continue if you get a warning that “Changes can’t be undone”. Use the default settings for this analysis (Ambiguously Mapped Reads: Count as partial matches, and Annotation Type: CDS) and click OK. When the analysis has finished it will automatically save, and you will get a prompt asking if you want to apply changes to the original sequences. Click Yes, then OK and then you will see an “Expression” track on each assembly.
This analysis calculates raw read, fragment and transcript counts, as well as normalized measures RPKM, FPKM and TPM. To see these values, you can either mouse over individual annotations in the Expression track, or switch to the Annotations tab above the sequence viewer to display values from the annotation track in tabular format.
Now switch back to the reference sequence NC_006841. Because you chose to apply the changes to the original sequence, you should be able to see 6 “Expression” annotation tracks on this sequence, one for each sample condition replicate. If you can’t see them, check that they are enabled in the Annotations and Tracks tab to the right of the sequence viewer.
Exercise 3: Comparing expression levels with DESeq2
We will now run DESeq2 to calculate differential expression between the two sample conditions. With the reference sequence selected, go to Annotate and Predict → Compare Expression Levels.
Choose DESeq2 as the Method.
Because we previously added a meta-data field for sample condition on our reads, Geneious has detected this and offered to automatically assign conditions using this field. You should see that all the “Plk” tracks are assigned to condition A, and all the “Vnt” tracks are assigned to condition B. If you do not have a field on your sequences for assigning sample conditions, you can set this manually.
Note that to assign conditions automatically by field, any field can be used as long as it contains only 2 different values. Geneious will detect valid fields and display them as options for automatic assignment.
The DESeq2 options should look like this:
Click OK to run the analysis.
The first time you run DESeq2, Geneious will download and install R and all the required packages. This will add a few extra minutes onto the analysis time. After the analysis is finished, you will see an extra track on your reference sequence called “Diff Expression, Sample condition, planktonic vs Squid-Associated”. Note that because we used a meta data field for assigning sequences, the track has automatically been named according to the values in this field.
DESeq2 uses the raw read count data for differential expression analysis. For a full description of the method, please refer to the DESeq2 website and paper. To view the R script that Geneious runs click the “R Script” link in the DESeq2 setup window.
Interpreting the results
Ensure you have saved your document, then switch to the Annotations table to view the differential expression results in tabular format. Under Track, choose to display only the Differential Expression track. You should see columns in the table for Differential Expression Log2 ratio, Expression Confidence, and p value. If you do not see these columns, add them by clicking the Columns button. The Log2 ratio column shows the fold change in expression between the sample condition A and sample condition B – if this value is positive, the gene is up-regulated in sample condition B (in this case, the squid associated samples), and if negative it is down-regulated. Click on the header for this column to sort the table by the log2 ratio. You will see that many of the top up-regulated genes are lux genes. These are within the lux operon, which plays a key role in controlling bioluminescence.
The adjusted p-value tells you whether the differential expression observed is statistically significant. This value is adjusted to account for multiple tests.
The Differential Expression Absolute Confidence is negative base 10 log of the adjusted p value, so is another way of expressing the significance. For example, a confidence value above 2 equates to a p value of less than 0.01. The Differential Expression Confidence column shows the same information, but the confidence values are adjusted to be positive for genes that are up-regulated in sample condition B compared to sample condition A, or negative for down-regulated genes. By default, Geneious uses this value to color the annotations: from blue for down-regulated genes, through to white for genes that are not differentially expressed, through to red for genes that are up-regulated.
Additional graphical tools for visualizing your results are available in tabs above the sequence viewer.
The PCA (Principal Component Analysis) Plot is a method for displaying the amount of variance in the data and can be used to check whether the replicates cluster together as a form of quality control. A “good” PCA plot should show that samples from the same sample condition cluster together and that the clusters should be reasonably well-separated along the X axis. In this example you’ll see that the Plk (planktonic) replicates are close together and are separated along the X axis from the Vnt (squid-associated) replicates. However one of the Vnt replicates, Vnt1, does not cluster with the other two. This may be the result of natural biological variation, but if you see this sort of pattern in your data you should also investigate the potential for experimental errors and possibly even discard the sample from the analysis.
In Geneious R11 onwards, results can also be displayed in a Volcano Plot. This shows the fold change (Log2 Ratio) plotted against the significance (-Log10 adjusted p value). The “outliers” on this graph represent the most highly differentially expressed loci. When you mouse over each dot on the graph you will see the gene name – click on the dot to display the gene name on the graph.
The dots on the Volcano plot are linked with both the annotations table and the annotations in the sequence viewer. Try switching back to the annotations table, sort it by Differential Expression Absolute Confidence and then select the top 10 genes. Now switch back to the Volcano Plot and you will see those genes are now labelled on the plot. The Volcano Plot also has several Advanced view options, as shown in the screenshot below. To access these, click the Advanced button above the viewer. Here you can add dotted lines showing expression ratio and p-value thresholds, and color dots that fall outside those lines.