Expression Analysis Tutorial
Learn to calculate normalized expression measures from RNA-Seq data. You will measure RPKM, FPKM and TPM on datasets from two different sample conditions then calculate differential expression between the two samples.
In this tutorial you will learn to calculate normalized expression measures from RNA-Seq data using the Geneious expression analysis tool.
Two datasets, each from a different sample condition are provided, and you will measure RPKM, FPKM and TPM on each dataset then calculate differential expression between the two samples.
For a description of these expression level metrics, please click here.
To complete the tutorial yourself with included sequence data, download the tutorial and install it by dragging and dropping the zip file into Geneious Prime. Do not unzip the tutorial.
Calculating expression levels
Measuring differential expression
Expression level measures
Exercise 1: Calculating Expression Levels
In this tutorial, you have been provided with 2 sets of reads, each from a different experimental condition, plus a reference sequence. We first need to map each set of reads to the reference sequence.
To do this, select the reference sequence (108885074), and (holding down cntrl/command) both sets of reads. Click Align/Assemble → Map to Reference. Select Reset to defaults under the Settings cog down the bottom left of the screen, and check that the 108885074 sequence is displaying in the Reference Sequence setting. Because we want to create separate assemblies for each sample condition, click Assemble each sequence list separately. Uncheck Save in subfolder and leave the rest of the settings as they are, then click OK to run the assemblies. It may take a few minutes for the assemblies to complete.
Now we will calculate RPKM, FPKM and TPM for each assembly. A description of each expression level metric is given here.
Select the first assembly (“Sample_condition_1 assembled to 108885074”), then go to Annotate and Predict → Calculate Expression Levels. Leave the default settings as they are, with Ambiguously mapped reads: Count as partial matches and Annotation type: CDS, and click OK.
You should now see a new track created on the reference in the assembly called “Expression: Sample_condition_1”. Click Save and choose “yes” when asked if you want to apply this to the original sequence – this will load the annotation track onto the Reference sequence document.
*Note: if you don’t see the prompt asking if you want to apply changes to the original sequence, you will need to reset your preferences. To do this, go to Tools → Preferences and under the Appearance and Behavior tab click “Reset Questions”.
Now repeat the analysis on the second assembly and save the results as before.
**Note that because these assemblies represent two different experiments, Calculate Expression Levels must be run separately on each one. If you select both assemblies and run the analysis on both at once, Geneious will assume that these are a single sample and your calculations will be incorrect. However, if you have multiple reference sequences for each sample (e.g. one set of reads mapped to multiple chromosomes), then all contigs from that sample should be selected and run in a single step.
You should now see two Expression annotation tracks (one for each sample condition) loaded on the reference sequence and both assemblies. If you cannot see them, check that both tracks are enabled in the Annotation and Tracks tab to the right of the sequence viewer.
Select the first assembly again and take a closer look at the Expression annotation track. By default, annotations are colored based on the TPM property, ranging from blue for 0, through to white for the mean TPM, up to red for the highest TPM for any gene in the sample. If you wish to color by a different property, select the down arrow next to the annotation track name and choose Color by / Heat map and choose the field you want.
Mouse over the track name, and you’ll see the total read, transcript and fragment counts and the Min/Mean/Max RPKM, FPKM and TPM for that assembly.
Now select any one of the annotations on the track and zoom in on it using the zoom arrows. Mouse over the annotation and you’ll see a popup window containing the values for RPKM, FPKM and TPM, as well as the raw read counts for that CDS.
We can display these values in a table as follows: Click the Annotations tab above the sequence viewer then click the Track button and choose the Expression: Sample_condition_1 track to display. If you cannot see the columns for RPKM, FPKM and TPM, click the Columns button and select these columns, as in the table below.
You can export this table in .csv format if you wish by clicking Export table (this may be under the double arrows >>).
Exercise 2: Measuring differential expression
Select the reference sequence again, and you should now see Expression level annotation tracks for each sample condition loaded on it. If you can’t see these tracks, check that they are enabled in the Annotation and Tracks tab to the right of the sequence viewer. To find genes that are differentially expressed between the two sample conditions go to Annotate and Predict → Compare Expression Levels.
Check that the tracks you created in the previous exercise are chosen for comparison: they should be Track 1 – Expression: Sample_condition_1 and Track 2 – Expression: Sample_condition_2. Choose to compare Transcripts normalized by Median of Gene Expression Ratios – this is the recommended method (see expression level measures for more information on these methods). Click OK.
You’ll now see a third annotation track added to your reference sequence called “Diff Expression: Sample_condition_1 vs Sample_condition_2”, and as with the individual expression level tracks, annotations are colored according to the results. Mouse over one of the annotations and you’ll see a popup window listing the raw read and transcript counts for each sample condition, plus a list of differential expression scores.
The Differential Expression p value tells you whether the differential expression observed is statistically significant. The Differential Expression Confidence is the negative base 10 log of the p value, adjusted to be negative for genes that are under-expressed in sample 2 compared to sample 1, or positive for over-expressed genes. By default, Geneious uses this value to color the annotations: from blue for under-expressed genes, through to white for genes that are not differentially expressed, through to red for genes that are over-expressed.
Try filtering based on this field to find all the genes that show significantly higher expression in sample condition 2 than sample condition 1: Display the Annotations and Tracks tab to the right of the sequence viewer and type “Differential Expression Confidence”>2 (including the quotation marks) in the Search box, as in the screenshot below. A Differential Expression Confidence level above 2 equates to a p value of less than 0.01. You should now see only 29 annotations displayed, all of which are pink or red in color, indicating significantly higher expression of that gene in sample condition 2 than sample condition 1 (at p<0.01). To scroll through these annotations, click the arrows to the right of the annotation track name in the Annotations and Tracks tab.
You can also display the results in tabular form by clicking the Annotations tab above the sequence viewer and displaying the “Diff Expression” Annotation track only. You may need to add the columns for Differential Expression Confidence and Ratio by clicking the Columns button. To sort the results in order of most over-expressed in sample 2 → most under-expressed in sample 2 click the “Differential Expression Confidence” header twice so that the highest values are at the top.
You can export this table in .csv format if you wish by clicking Export table (this may be under the double arrows >>).
Expression Level Measures
Expression measures need to be normalized to remove biases that can be introduced during sequencing, such as sequencing depth and length of the RNA transcript. Geneious calculates three expression level measures on individual samples, which are normalized so that genes expressed in the same sample can be compared:
Reads per kilobase per million normalizes the raw count by transcript length and sequencing depth.
RPKM = (CDS read count * 10^9) / (CDS length * total mapped read count)
Same as RPKM except if the data is paired then only one of the mates is counted, ie. fragments are counted rather than reads.
Transcripts per million (as proposed by Wagner et al 2012) is a modification of RPKM designed to be consistent across samples. It is normalized by total transcript count instead of read count in addition to average read length.
TPM = (CDS read count * mean read length * 10^6) / (CDS length * total transcript count)
The above metrics are calculated by normalizing the count of reads that map to each CDS annotation. If a read at least partially intersects at least one interval from a CDS annotation, then it will be treated as though that read mapped to that CDS annotation. For reads that map to multiple locations, or reads that map to a location that intersect multiple CDS annotations, these may either be counted as partial matches, excluded from the calculations, or counted as full matches to each location they map to. We recommend counting reads as partial matches, for example if a read maps to two locations, then it will be counted as if 0.5 reads mapped to each of the two locations. When calculating statistics, reads that don’t map or map outside of a CDS annotation are ignored.
Differential expression measures
For comparing across samples additional normalization is required, as different samples may contain different quantities of transcript. The choice of normalization method determines the Differential Expression Ratio for each gene. The following normalization methods are available in Geneious:
- Total Count: The counts in each gene are scaled according to the total number of transcripts mapped to all genes. For example, if one sample has twice as many transcripts mapped as the other sample, then the counts for each gene need to be halved to make them comparable with the other sample.
- Median Expression: The expression level of all expressed genes from the sample are calculated and the median values of these from each sample are used to normalize. For example, if one sample has a median twice as high as the other sample, then the counts for each gene need to be halved to make them comparable with the other sample.
- Total Count Excluding Upper Quartile: The expression level of all expressed genes from the sample are calculated and the total number of reads, fragments, or transcripts from the lowest 75% of those are totaled. Values are normalized between samples based on this total.
- Median of Gene Expression Ratios: For each gene the ratio of the expression level between samples is calculated. Then the median ratio across all expressed genes is used as the normalization scale. This normalization method is the same as that implemented by DESeq.
All of these normalization methods (and more) are described and compared by Dillies et al 2012, and they recommend using Median of Gene Expression Ratios rather than the other three normalization methods implemented here. One reason for this is that a few highly expressed genes can greatly affect the total number of transcripts produced, so this can distort the fraction of the total reads that contribute to genes with lower expression.
Values to Compare
Either read counts, fragment counts, or transcript counts from each annotation can be compared. Since a single transcript can produce multiple reads and fragments, the number of reads and fragments produced aren’t independent events so the confidence values produced by comparing these are unlikely to be accurate. For this reason we recommend comparing samples using transcript counts.
In addition to calculating the differential expression ratio, it is useful to know whether or not that differential expression is statistically significant. This is represented by a p-value. A number of advanced methods have been published for the calculation of p-values based on a range of assumptions. Many of these are compared by Soneson & Delorenzi 2013 and they conclude that no single method is optimal under all circumstances and that very small samples sizes impose problems for all evaluated methods.
In this basic differential expression plugin in Geneious we have implemented a simple statistical test based on the assumption that the gene which each observed transcript came from is an independent event.
For a given gene, the probability that a randomly selected transcript would come from that gene is calculated as number of transcripts mapped to that gene/total number of transcripts from that sample. This probability is normalized, the mean probability between the two samples calculated, and this mean un-normalized for each sample. This produces an expected probability that a randomly selected transcript from this sample comes from that gene, assuming that this gene is not differentially expressed.
The Binomial Distribution is used to calculate the probability that an observed count at least as extreme as the observed one would be seen, assuming this non-differentially expressed mean probability. The probabilities from each sample are multiplied together to form the p-value.