Understanding Phylogenetics

Learn the basics of phylogenetics with this overview of phylogenetic treesm how they work and how to build them.

What is Phylogenetics?

Phylogenetics (specifically, molecular phylogenetics) is the study of the evolutionary relatedness between organisms or other taxonomic groups based on analyses of  DNA, RNA, or protein sequences. These analyses produce phylogenetic trees, which visually display the inferred relationships amongst organisms or taxa (Figure 1).

Groups of organisms, or taxa, that are closely related appear closer together in the tree (ex: Bacteroides and Thermotoga), while those that are distantly related are further apart (ex: Bacteroides and Flagellates). Longer branches indicate greater evolutionary distance or divergence (i.e. more change) from the branch point. Branches that split closer to the tree’s root are more ancient, or have diverged earlier in evolutionary history (ex: Microsporidia). Branches that split further from the root have diverged more recently and these taxa will be more closely related (ex: Plantae, Fungi, Ciliates).

Figure 1. Phylogenetic tree of life.

Phylogenetics has a long history dating back to before scientists knew about the structure of DNA or how to sequence DNA. In the early days of phylogenetics, scientists studied the relationship between organisms based on their physical characteristics. Now, scientists can compare the sequences of DNA, RNA, or protein to determine relatedness.

Phylogenetics helps scientists:

  • Understand evolutionary history of species
  • Identify and classify species
  • Understand the evolution of a disease
  • Understand how viruses evolve (ex: for vaccine development)

Parts of a Phylogenetic Tree

While there are many types of phylogenetic trees, phylogenetic trees have many parts in common (Figure 2):

  • Topology: The topology of a tree is the overall branching pattern of a tree.
  • Branch: A branch represents a lineage - a group that originates from a common ancestor.
  • Tip: The tip at the end of a branch represents a species, gene, taxon, etc.
  • Node: The node represents the common ancestor of the lineages that branch from it.
  • Clade: The clade includes a common ancestor and all of its descendants.
  • Root: The root is the common ancestor to all species within the tree. 
  • Outgroup: The outgroup represents a taxon that diverges from all other taxa represented in the tree.
Figure 2. Different components of a phylogenetic tree.

Types of Phylogenetic Trees

While phylogenetic trees, which are also called dendrograms, show relatedness between taxa, different types of trees provide additional contexts for these relationships.

Cladogram

The length of the branch is arbitrary and does not represent time or evolutionary change.

Phylogram

The branch length represents the amount of evolutionary change, but does not provide any indication of time. These trees result in uneven tree tips.

Chronogram

The length of the branch represents time, but the length does not provide information on evolutionary change.

Rooted or Unrooted

Phylogenetic trees can also be rooted or unrooted. Rooted trees contain a common ancestor to all taxa within the tree while an unrooted tree does not show a common ancestor (Figure 4). Rooted trees provide directionality and history to the taxa within the tree. In contrast, unrooted trees show evolutionary relationships but don’t indicate directionality of evolution. In the example below, the rooted trees show the evolution of archaea before eukaryotes. However, you cannot infer whether archaea or eukaryotes originated first from the unrooted trees.

Figure 4. Rooted tree (left) and an unrooted tree (right).

Phylogenetic Analysis Step by Step

To construct a phylogenetic tree, scientists begin with sequences of interest and use multiple steps to arrive at a phylogenetic tree. Briefly, these steps include: (1) multiple sequence alignment, (2) constructing the tree using the selected tree building method, and (3) assessing the reliability of the tree.

Multiple Sequence Alignment

Aligning sequences before tree building is a crucial step because it allows researchers to know which positions or residues within one sequence correspond to the same positions or residues in another sequence. Since we cover sequence alignment in another article, we won’t discuss it in detail here.

Phylogenetic Tree Construction Methods

Various algorithms take into account different principles and evolutionary models to help phylogeneticists construct trees (Figure 5). While you can construct trees manually, the process can be slow and tedious so there are many programs, such as Geneious Prime, available to build trees. 

Figure 5. Types of tree construction methods.

Distance-based methods

Distance-based phylogenetic trees are based on the total number of evolutionary changes between pairs of sequences. These methods are best used when computational resources are limited, or for exploratory analysis of large datasets before conducting more intensive tree building using character-base methods (described below).

Starting from the alignment, these methods look at all possible pairs of the aligned sequences and count how many bases (or amino acids) are different at each position. These pairwise differences are represented in a distance matrix that informs the order that sequences are added into the tree one at a time.

Two distance-based methods are commonly used:

  • Unweighted Pair Group Method with Arithmetic Mean (UPGMA):The UPGMA provides rooted trees and is based on the assumption that there’s an equal rate of evolution of all sequences. UPGMA uses the distance matrix to cluster two sequences with the smallest pairwise distance together. Then, another distance matrix is created between this pair and all other sequences to identify the sequence that should be placed into the growing phylogenetic tree. This process repeats until all sequences are incorporated into the tree. This method is computationally efficient, but the assumptions about the constant rate of evolution may not reflect actual events.
  • Neighbor joining (NJ): In contrast, NJ trees allow for unequal rates of evolution between sequences. These trees begin with a star tree where all taxa are joined through a single node. Then, nodes are sequentially added to cluster the two closest related taxa together. This process repeats for the rest of the sequences in the tree. These methods produce unrooted trees where branch lengths reflect the amount of change.

Pros and cons of distance-based methods

Pros

  • Quicker and less computationally intensive than character-based methods
  • Suitable for large datasets

Cons

  • Treats all types of genetic changes equally so it may not represent evolutionary processes well
  • Only one tree is proposed in contrast to character-based methods which evaluates multiple possible trees and selects the best one

Character-based methods

Unlike distance-based methods, which use the total number of evolutionary changes, character-based methods compare all sequences by considering one “character” (such as the nucleotide or amino acid) in the alignment at a time. As character-based methods incorporate models that reflect different rates of evolution in types of sequence changes, these methods may be more accurate than distance-based methods. However, they are more computationally intensive. Because character-based methods examine changes in one character at a time, they can capture events such as convergent evolution and homoplasies. These methods construct multiple trees that are evaluated and ranked, resulting in selection of the best tree.

Character-based methods include the maximum parsimony and maximum likelihood models:

  • Maximum parsimony seeks to provide the simplest explanation for a phenomenon. In phylogenetics, this means building all possible trees and selecting the best tree based on the smallest number of evolutionary changes (ex: insertion, deletion, substitutions, etc.) required to explain the relatedness between sequences.
  • The maximum likelihood model constructs all possible trees and selects the tree that most likely predicts the relatedness between sequences based on a specific evolutionary model. These models take into account how likely different genetic changes occur over time. For example, maximum likelihood models can account for the probability of different types of substitutions (A to G, C to T, etc.), the frequency of transition (purine replaced with purine, pyrimidine replaced with pyrimidine) vs. transversion (purine replaced by pyrimidine, pyrimidine replaced with purine), and the rate of substitutions among different sites within a DNA sequence.

Pros and cons of character-based methods

Pros 

  • Takes into account evolutionary processes
  • More statistically rigorous than distance-based methods

Cons

  • More computationally intense than distance-based methods
  • Requires careful model selection as different models make different assumptions about the evolutionary process behind the data 

Assessing Phylogenetic Tree

Once you’ve built the phylogenetic tree, you next need to determine if the tree is reliable and whether it is significantly better than other possible trees. To do this, phylogeneticists use the statistical technique called resampling. In resampling, a new data set is created from the original data and run through the same algorithms as the original data.  

Bootstrapping is a widely used method of resampling. In bootstrapping, resampling means sampling random sites from the multiple sequence alignment (Figure 6) to generate a sequence alignment of the same length as the original sample. Resampling the same site can occur more than once (i.e. sampling with replacement). An alternative and less computationally demanding method to bootstrapping is jackknifing, where original sites (ex: columns in the alignment) are systematically deleted.

Figure 6. Resampling by bootstrapping.

In either resampling method, the resampled data passes through the same algorithms as the original alignment to create a tree. This process is repeated several times and a consensus tree is built from all resampled trees. In such a consensus tree, each branch is assigned a bootstrap support value. This value represents the proportion of bootstrapped trees that support a particular branch in the consensus tree and is given as a percentage. For example, a branch with a bootstrap value of 95 means that 95% of the bootstrapped trees contain that specific branch, giving confidence that this branch in the consensus tree is reliable.

Phylogenetics Resources in Geneious Prime

The neighbor-joining and UPGMA methods are built-in to Geneious Prime while character-based methods are available as a plug-in. Find more information about building phylogenetic trees in the tutorials and articles below:

Phylogenetic Trees: Video series on building and manipulating trees in Geneious Prime.

Building Phylogenetic Trees Tutorial: Written tutorial with practice exercises to align sequences and build, view, and manipulate trees in Geneious Prime.

How to build a phylogenetic tree in Geneious Prime: Knowledge Base article on building trees in Geneious Prime.

Recommended Resources

A Brief Tour of Geneious Prime

Take look at the Geneious Prime interface with this brief tour video.

Geneious Prime Features

Geneious Prime puts industry-leading bioinformatics and molecular biology tools directly into researchers' hands.

Geneious Prime Knowledge Base

The most commonly asked questions about Geneious Prime installation, licensing, functionality and more.

Get Started with Geneious Prime

Start Your 30 Day Free Trial of Geneious Prime.

Geneious Academy

Learn the basics of DNA sequencing with this introductory article on Sanger, Next Generation and Long-read Sequencing.
Learn the basics of sequence alignment with this overview on the different methods used to align sequences.
In this video learn the basic steps to building a phylogenetic tree and manipulate it using the Geneious tree viewer.
Practice aligning sequences, building a phylogenetic tree, and viewing and manipulating the tree.
Get started with Geneious today