Assembling Your DNA Sequences

Learn the basics with this introductory guide to assembling DNA sequences with de novo and map to reference.

Introduction

Advances in DNA sequencing have propelled genomic research in the last decade. In the early days of DNA sequencing, methods like Sanger sequencing allowed scientists to assess only one DNA fragment up to 1,000 bases in length at a time. Now, scientists can analyze billions of DNA fragments simultaneously with next-generation sequencing (NGS) technologies.

What is Sequence Assembly?

Because NGS technologies generate short sequences of <300 base pairs (bp) in length, it’s impossible to sequence an entire chromosome or genome in one continuous fragment. Therefore, NGS methods rely on fragmenting a large piece of DNA into smaller pieces, sequencing the fragments, and then putting them back together like a puzzle. This process is called sequence assembly of which there are two main methods, de novo assembly or map to reference assembly.

As NGS results provide millions to billions of reads, assembling NGS reads manually is highly impractical. Fortunately, many algorithms and tools, such as Geneious Prime, are available to do this. 

Sequence assembly has many applications, including:

  • Assembling genomes or genes to study their relationships for phylogenetic studies
  • Identifying variants associated with a disease or condition
  • Assembling whole genomes to understand the organism’s genes, pathways, and capabilities
  • Understanding microbial diversity, community structure, and the functional potential of a microbial community 

The journey of a DNA sample to assembled sequences encompasses five general steps: 

  1. Library preparation, 
  2. DNA sequencing, 
  3. Quality control and preprocessing of sequencing data, 
  4. Assembling the sequences using assembler software
  5.  Assessing the assembly.

Library Prep and Sequencing

The first step of NGS is library prep. During this stage, the DNA is fragmented, end-repaired, and attached to adapters before attaching to the flow cell for sequencing. Learn more about library prep and the sequencing process

Introduction to DNA Sequencing.

Quality Control and Preprocessing

Preprocessing (aided by various computational tools) trims and removes low quality reads to exclude them from the assembly process. This step is important to ensure the reads used in assembly are high quality and don’t contain extraneous sequences that could interfere with assembly.

Preprocessing includes:

Quality check

Quality checks give information about the overall quality of the raw data and provide information on: (1) coverage, or the percent of the sample sequenced, (2) sequencing depth, which gives information on the number of times a specific base has been sequenced among all of the reads and can help determine how reliable a base call is, and (3) base quality (Phred quality score), which indicates how likely that a base call is correct.

Filtering

Filtering removes reads that fail the quality check (ex: low quality reads).

Read pairing

If using paired-end sequencing, reads corresponding to the same DNA fragment are paired and any overlapping paired reads can be merged together into a single read.

Trimming

Trimming removes sequences corresponding to adaptors and barcodes as well as leading and trailing bases with low quality scores.

Best practice for preprocessing NGS reads in Geneious Prime

Assemble Using Assembler Software

Software, such as Geneious Prime, can help scientists assemble DNA using de novo assembly, map to reference assembly, or a combination of both methods. For example, reads assembled into longer sequences based on their overlaps, or contigs, can then be mapped to a reference sequence.

De Novo Assembly

De novo (Latin for “from the beginning”) sequence assembly occurs without a template sequence and assumes no prior knowledge of the source DNA sequence. De novo sequencing is useful when assembling novel genomes or searching for unknown genes. This method can also give insight into repetitive regions of DNA and aid in identifying larger structural variations such as insertions, deletions, and translocations.

The de novo assembly process reconstructs sequences based on overlaps between different DNA fragments in the following steps (Figure 1):

1. Reads -> Contigs

Reads, typically 50-300 bases in length, are assembled together based on overlapping reads. This process occurs until no other reads can be assembled into a continuous sequence. This sequence, known as a contig, is typically a few hundred to thousands of bases long. 

2. Contigs -> Scaffolds

Contigs can then be assembled into scaffolds based on overlapping regions between contigs or based on paired-end reads across two different contigs. Unlike contigs, scaffolds can contain gaps of unknown sequences if a paired-end read spans two contigs. Scaffolds are typically a few hundred bases to megabases in length.

3. Scaffolds -> Chromosomes

Then, overlapping scaffolds are assembled into chromosomes. Completing a genome assembly end-to-end is often challenging, as regions with repetitive sequences or low coverage may leave gaps. However, there are ways to fill these gaps using computational algorithms (ex: RFfiller, Figbird, FGAP) or additional rounds of sequencing.

Figure 1. The de novo assembly process involves sequential steps of assembling reads into contigs, contigs into scaffolds, and scaffolds into a chromosome.

Pros and cons of de novo sequencing

Pros

  • Doesn’t require a reference genome
  • Ideal for assembling sequences from unknown samples
  • Can help finish the genomes of known organisms
  • Can help identify structural variations such as insertions, deletions, and translocations

Cons

  • Requires higher-quality data (vs. mapping to a reference genome)
  • Requires more computational power than mapping to a reference genome

De novo assembly in Geneious Prime

De novo Assembly: Video series on performing de novo assembly and assembling circular contigs.

De novo Assembly Tutorial: Written tutorial on the de novo assembly of NGS data with example exercises.

Which de novo assembly algorithm is best for my data? Article summarizing pros and cons of various assemblers available in Geneious Prime.

Map to Reference Assembly

In contrast to de novo assembly, map to reference assembly aligns sequencing reads to an already assembled reference sequence (Figure 2). This method is also called reference guided assembly. The first step when mapping to a reference sequence is the selection of the reference sequence. This is typically the sequence that is most closely related to the genome of interest. After preprocessing the raw reads, they are aligned to the reference sequence.

Understanding Sequence Alignment. 

Figure 2. In map to reference assembly, reads are mapped to a reference genome.

By comparing reads to known genomes, map to reference assembly helps scientists identify single nucleotide polymorphisms (SNPs) and small variations in sequences. This method is also the gold standard method in diagnostics. Computational tools like Geneious Prime can help scientists identify variants such as SNPs, insertions and deletions, as well as structural variations.

Pros and cons of map to reference assembly

Pros

  • Assembly is more straightforward, accurate, and faster
  • More tools available

Cons

  • Can only be used if a reference genome is available
  • Alignment doesn’t necessarily match the original DNA sequence (ex: there can be large variations in the DNA sequence of interest that won’t be identified if mapping to a reference genome).

Map to reference assembly in Geneious Prime

Map to Reference: Video series on how to use the map to reference tool in Geneious Prime.

Mapping and SNP Calling Tutorial: Written tutorial on how to perform a map to reference assembly and SNP calling.

Calling and Comparing SNPs: Video series on performing a reference assembly and calling SNPs in Geneious Prime.

Which map to reference assembly algorithm is best for my data? Article highlighting the pros and cons of various mappers.

Assess Assembly

Once the assembly is complete, how do you know if the assembly accurately represents the original DNA sample? Bioinformaticians use many metrics to assess the completeness, contiguity, and accuracy of an assembly. For example, the commonly used N50 metric gives us an indication of the contiguity of the assembly. To use this metric, first calculate the total length of all contigs and order them by length. The N50 is the length of the contig where over 50% of the total assembled sequences is contained in contigs of that length or larger (Figure 3). While a larger N50 indicates that the assembly produced larger contigs and is likely a better quality assembly, a “good” N50 depends on many factors including whether sequencing data came from short-read or long-read sequencing.

Figure 3: Calculating the N50. The N50 is 45 in the example above as over 50% of sequences (45+65 = 110) are contained in contigs that are at least 45 nucleotides long.

If the assembly does not meet the criteria desired, the assembly can be redone using different parameters or different assembly tools. However, assembly can fail for reasons beyond computation tools. For example, the nucleic acid extraction might not have produced samples that meet the quality requirement for the sequencing methods. In other cases, some sequences are just difficult to assemble regardless of computational methods because they have a lot of repeated regions.

Paired-end vs. Single-end

Paired-end and single-end sequencing are two approaches in NGS methodologies. As their names suggest, in single-end sequencing, only one end of the DNA is sequenced, while in paired-end sequencing, both ends of the DNA fragment are sequenced (Figure 4). Paired-end sequencing might not necessarily provide sequencing data for the entire length of the fragment, but it can help bridge the gaps between contigs as the distance is known between the paired reads.

Figure 4. A comparison between single-end sequencing and paired-end sequencing. While a single-end sequencing read only gives information about that read on its own, paired-end sequencing reads provide distance information between two reads even though the exact sequence in between the paired reads is unknown.

The choice between single-end and paired-end sequencing has many implications for sequence assembly. As the distance between the two reads is known, paired-end sequencing provides longer-range information that overlapping single-end reads cannot. For example, paired-end reads across two contigs can help assemble them into a scaffold. Paired-end sequencing also gives a more accurate alignment to reference sequences and can help determine the read’s orientation to the reference sequence. Paired-end sequencing is also helpful in deciphering genomic rearrangements and repetitive sequences.

While single-end sequencing libraries are easier to prepare, single-end sequencing makes it challenging to assemble repetitive DNA regions and don’t provide information across different contigs. Assemblies tend to have more gaps compared to assemblies done using paired-end reads.

Recommended Resources

A Brief Tour of Geneious Prime

Take look at the Geneious Prime interface with this brief tour video.

Geneious Prime Features

Geneious Prime puts industry-leading bioinformatics and molecular biology tools directly into researchers' hands.

Geneious Prime Knowledge Base

The most commonly asked questions about Geneious Prime installation, licensing, functionality and more.

Get Started with Geneious Prime

Start Your 30 Day Free Trial of Geneious Prime.

Geneious Academy

Learn the basics of sequence alignment with this overview on the different methods used to align sequences.
Watch the series and learn how to perform a de novo assembly of short-read NGS data and assemble circular contigs.
Use the map to reference tool to map next-generation sequencing data against a reference, and assemble coronavirus genomes.
Use this practical exercise to perform a de novo assembly of short-read NGS data and assemble circular contigs.
Get started with Geneious today