Building Phylogenetic Trees

In this tutorial you will learn how to align sequences and build a phylogenetic tree, using HIV sequences as an example. You will learn how to view and manipulate the tree to answer questions on the origins of HIV-1.

DOWNLOAD PHYLOGENETIC TREES TUTORIAL

Tutorial Instructions

Geneious Prime tutorials are installed by either 'Dragging and dropping' the zip file into Geneious Prime or using File → Import → From File... in the Geneious Prime menu. Do not unzip the tutorial.

Created from tutorials developed by Dr Jack da Silva, University of Adelaide, Adelaide, Australia; Dr Howard Ross, Bioinformatics Institute University of Auckland, Auckland, New Zealand; and the Biomatters team.

Phylogenetic Trees Tutorial

Investigate the evolutionary origins of HIV

Note: To complete the tutorial with the referenced data please download the tutorial above and install in Geneious Prime.

In this tutorial, you will use Geneious Prime to investigate the evolutionary origins of human immunodeficiency viruses (HIVs) using molecular phylogenetic tools. You will learn how to align sequences and build a phylogenetic tree, as well as how to view and manipulate the tree to answer questions on the origins of HIV-1.

Introduction: Human and Simian Immunodeficiency Viruses
Exercise 1: Multiple Alignment of HIV and SIV sequences
Exercise 1: Build a Phylogeny of HIVs and SIVs
Exercise 2: Molecular Phylogenetics of HIVs and SIVs
Exercise 3: The Origin of the HIV-1 Pandemic

Introduction: Human and Simian Immunodeficiency Viruses

HIVs, the causes of acquired immune deficiency syndrome (AIDS), are closely related to simian (monkey and ape) immunodeficiency viruses (SIVs). These and other similar viruses are retroviruses. Retroviruses are characterised by their RNA genomes, which once inside a host cell, are reverse transcribed into DNA and then integrated into the host cell’s genome. The integrated viral genome is known as a provirus. You will be working with proviral DNA sequences.

The origins of HIVs were mysterious when these viruses were first discovered in the early 1980s. There are two types of HIVs. HIV type 1 (HIV-1) is more widespread and causes more severe disease than HIV type 2 (HIV-2). HIV-1 is also far more diverse than HIV-2. HIV-1 is classified into three major groups: M, N, and O. The viruses causing the AIDS pandemic (widespread epidemic) belong to Group M. Group M is subdivided into several subtypes. You will be analysing sequences from HIV-1 Group M Subtypes A, B, C, D, F, G, H, J, K. The HIV-1 viruses infecting people in North America, Europe and Australia are mostly from Group M Subtype B. All groups and subtypes of HIV-1 and HIV-2 are found in Africa.

Both HIV-1 and HIV-2 are closely related to SIVs found in a variety of African primate species. This lead early on to researchers hypothesising that HIVs had jumped to humans from one or more African primate species. It was suggested that close contact between humans and monkeys that were kept as pets or hunted for food had allowed the SIVs to jump hosts.

More information on HIV can be found on this Wikipedia page.

In this tutorial you will use molecular phylogenetics to determine the evolutionary relationships of HIVs and SIVs, and so determine from which African primates HIVs originated. In Exercise 1 you will build an alignment of the HIV and SIV sequences, then in Exercise 2 you will learn to build a basic phylogenetic tree. Exercises 3 and 4 provide questions and answers to further your understanding on interpreting phylogenetic trees.

SIV sequences and primate taxa

The sequences in this tutorial come from various African primate species known to be infected with different SIVs. There are also three non-African species, all from Asia, that have been infected with SIVs in captivity: the pig-tailed macaque, the rhesus macaque and the stump-tailed macaque. The SIVs from all of these primate species are referred to by the three-letter code given with each picture. For example, the SIV from the sooty mangabey is called SIVSMM and the sequence in the alignment or tree is labelled SIV-SMM.

Mona monkey
Cercopithecus mona mona [denti]
MON [DEN]

de Brazza’s monkey
Cercopithecus neglectus
DEB

Tantalus monkey
Chlorocebus tantalus
TAN

Syke’s monkey
Cercopithecus albogularis
SYK

Greater spot-nosed monkey
Cercopithecus nictitans
GSN

Green monkey
Chlorocebus sabaeus
SAB

Mustached guenon
Cercopithecus cephus
MUS

Vervet monkey
Chlorocebus pygerythrus
VER

Grivet
Chlorocebus aethiops
GRV

L’Hoest’s monkey
Cercopithecus lhoest
LST

Sooty mangabey
Cercocebus atys
SMM

Red-capped mangabey
Cercocebus torquatus
RCM

Sun-tailed monkey
Cercopithecus solatus
SUN

Mandrill
Mandrillu sphinx
MND

Drill
Mandrillus leucophaeus
DRL

Pig-tailed macaque
Macaca nemestrina
MNE

Stump-tailed macaque
Macaca arctoides
STM

Rhesus macaque
Macaca mulatta
MAC

Common chimpanzee
Pan troglodytes
CPZ

Exercise 1: Multiple alignment of HIV and SIV sequences

Before a phylogeny can be constructed, the sequences must be aligned. The objective of sequence alignment is to maximize the similarity between sequences, inserting gaps in sequences where necessary to improve the overall alignment.

Multiple alignment algorithms use a scoring system where sequence matches and mismatches for each site are assigned a value, and gaps are penalized. The insertion of gaps in an alignment can increase the similarity of the surrounding bases, so the overall alignment score is a trade-off between the increased match/mismatches scores and the cost of opening and extending a gap.

In this exercise you will construct an alignment of 62 env sequences of HIV-1, HIV-2, and various SIVs. The SIV sequences come from various African and non-African primate species.

The env gene is found in all retroviruses. It codes for two viral envelope glycoproteins that are positioned on the virion surface and interact with host cell-surface receptors.

Click on ‘HIV_sequences’ to view the sequences.

The sequences are labelled in the format: virus type; followed by the common name of the primate species for the SIV sequences, or the group or subtype for HIV-1 and HIV-2 sequences; finally followed by the accession number.

To align these sequences, go to Align/Assemble -> Multiple Align. Geneious has 3 different alignment programs built in (Geneious aligner, MUSCLE, and Clustal Omega), plus a plugin for the MAFFT aligner is available. For further information on these aligners please see this article. We will use the MUSCLE aligner for this example, as it is suitable for a medium sized dataset.

Select MUSCLE alignment from the alignment options. We will use the default parameters, so click on the settings cog in the bottom left of the window and choose Reset to defaults (if it is greyed out, the default parameters are already set). Click the More Options button to view the parameters if you wish. Click OK to start the alignment – it may take several minutes to complete.

Once the alignment has completed, click on it to view it and zoom in to see the bases. Note that there are many large gaps, which is characteristic of an alignment of a rapidly evolving gene in divergent species.

Exercise 2: Build a Phylogeny of HIVs and SIVs

In this exercise you will construct a phylogeny using the Neighbour-Joining tree building method and the Tamura-Nei model. Models of evolution describe expected frequencies of each nucleotide and the rate of change between nucleotides. The Tamura-Nei model assumes each base has a different equilibrium frequency and allows transitions and transversions to occur at different rates. It allows the two types of transitions (A ↔ G and C ↔ T) to have different rates. This is useful when analysing HIV sequences because HIV exhibits hyper G-to-A mutation caused by a host enzyme (APOBEC3G). You will use the Neighbour-Joining method because these sequences do not, in general, evolve in a clock-like manner.

Select the alignment you created in Exercise 1.

To construct a Neighbour-Joining tree using the Tamura-Nei model, with bootstrapping, click the Tree button and select the Geneious Tree Builder. Check that the default parameters are initially set by clicking Reset to Defaults.

For the genetic distance model select Tamura-Nei and for the tree build method select Neighbor-Joining. Set the outgroup to “SIV-MON; Mona monkey; AY340701”. This sequence will be used to root the tree.

To calculate support values for the tree use bootstrapping. To do this, tick the box next to Resample tree and select Bootstrap in the dropdown box next to resampling method. Set number of replicates to 100 and the support threshold to 0.

The tree building options should now look similar to this:

Click OK to build the tree.

Once the tree builder completes, the tree document will appear in the document table in Geneious and should open automatically.

Viewing and Manipulating Phylogenetic Trees

A phylogenetic tree is a branching diagram of evolutionary relationships. It contains information about the order of evolutionary divergences within, and hence about the relationships among, a group of organisms. It can also contain information about the amount of evolutionary change which occurred between any two branching events. The lines on the the tree are called branches and the intersections of these lines are called nodes. A node represents a branching event in the tree. The branching pattern of a tree is called its topology. The topology shows how organisms are related to one another.

Depending on the size of your screen and the size of the tree, it may not be physically possible to display all of the sequence names on the tree, so Geneious will only display some of the sequence names. To zoom in on the tree, use the Zoom slider under “General” in the panel on the right hand side of the tree view. To expand the distance between the branches of the tree, use the Expansion slider. As the amount of space between the branches increases, more sequence names will be displayed on the tree.

As this tree was created using an alignment in Geneious, the alignment is attached to the tree. Click on the “Alignment View” tab to view the alignment.

The sequences in the alignment are sorted according to the topology of the tree. On the left hand side of the sequence names, you can see the tree topology (this may not be visible if you are working with large trees). Select the “SIV-MON; Mona monkey; AY340701” sequence in the alignment then return to the “Tree View”. This sequence is now selected in the tree as well.

The sequences used to build this alignment and tree have additional meta-data associated with them (this is the data found in the “Properties” field in the “Info” tab in the individual sequence documents). This information can be displayed on the tips of the trees. To display the organism on the tips of the tree, select “Organism” from the box next to “Display” under “Show Tip Labels”.

To display the organism and host organism, hold Ctrl (on Windows) or Cmd (on Macs) and select “Organism” and “Host Organism”. Now the host organism and organism are displayed on the tips of the tree, separated by a comma. To display the sequence names on the tree, select “Names”.

Just as a sentence can be printed using different fonts, or colors of ink, without any change in meaning, so too can trees be represented in different shapes and orientations. The information encoded in the tree remains unchanged, even as the appearance changes. For example, the appearance of the tree can be changed by rotating groups of branches. To rotate the branches, select an internal node in the tree and click the Swap Siblings button at the top of the window. This will rotate the branches in that subtree; however, the degree of relatedness is not altered by rotating branches in a tree. Simply having two names close together in a tree does not imply any close relationship.

Try this with the tree you have created. Select the node in the tree containing the Grivet monkey and the four Vervet monkeys and click the Swap Siblings button.

The order of these samples will change in the tree, but the relationship between the sample from the Grivet monkey and those from the four Vervet monkeys has not changed.

Rooted Trees

Trees may be unrooted or rooted. To view the HIV tree as an unrooted tree, click one of the unrooted views under the “General” options in the panel on the right hand side of the tree view.

Unrooted trees do not tell us much about evolutionary relationships. We cannot tell which node is the ancestor and which are the descendent nodes on the tree. To establish ancestor-descendent relationships we need to identify a suitable outgroup and then root the tree on the branch separating the outgroup from the remainder of the tree (the ingroup). We can specify the root before the building the tree to produce a rooted tree, or we can specify the root after the tree is built to change an unrooted tree to a rooted tree.

When you built the tree of HIV and SIV sequences you specified an outgroup (“SIV-MON; Mona monkey; AY340701”) so Geneious has produced a rooted tree. To view the tree as a rooted tree, click the rooted view under the “General” options in the panel on the right hand side of the tree view.

Rooted phylogenetic trees may be oriented horizontally, as above, or vertically. Here the time axis is implicit, running from left to right. The node at the left end of the tree is the root node, which represents the oldest point on the tree. As we move from the root node, we can identify nodes which are ancestral to their descendent clades. Working in from the tips of the tree enables us to identify close and distant relatives. The degree of relatedness of any two organisms is given by how far back on a rooted tree you must go to find their common ancestor. If, in tracing back to the common ancestor of A and B, you pass the common ancestor of A and C, then you can say that A and C are more closely related than A and B.

On a rooted tree, each node and all of its descendent nodes form a clade. This is what we would commonly refer to as a “branch” on a real tree – the physical branch and all the little branches and leaves attached to it. Because an unrooted tree lacks the time axis described above, it is inappropriate to discuss clades in that context.

Phylograms and cladograms

The lengths of the branches of a tree may be arbitrary (eg. cladogram) or can represent the amount of the evolutionary change (phylogram).

In a phylogram, the lengths of the branches are proportional to the amount of change which occurred between those branching events. As the tree you built was estimated using a distance (1 – similarity) measure (i.e. NJ), the proximity of nodes represents their overall degree of similarity.

To display the lengths of the branches of the tree, in the panel on the right hand side of the tree view, select “Substitutions per site” from the dropdown box next to “Display” under “Show Branch Labels”.

On your tree, find “SIV-MAC; Rhesus macaque; M33262” and “SIV-MNE; Pig-tailed macaque; U79412” and look at the length of the branches separating these two taxa. Now find “SIV-RCM; Red-capped mangabey; AF382829” and “SIV-RCM; Red-capped mangabey; AF349680” and look at the length of these branches. The length of the branches separating the SIV-MAC and SIV-MNE sequences is shorter than the length of the branches separating the two SIV-RCM sequences. From this you can conclude that SIV-MAC is more similar to SIV-MNE, than the two SIV-RCM sequences are to each other.

If an optimality method (e.g., MP or ML) was used to estimate the tree then the proximity of two nodes reflects the number of evolutionary changes in character states estimated to have occurred between them. If the total branch length from the root of a tree to organism A at one tip is much greater than from the root to organism B at another tip, then you can say that evolution has been faster in the A lineage than in the B lineage for the characters on which the tree was based.

To transform the tree to a cladogram, tick the Transform branches box in the “Formatting” options. In the dropdown box next to Transform select Cladogram

Notice how the branch lengths of the tree change and all of the tips of the tree are aligned on the right hand side of the tree view. With this transformation the lengths of the branches are meaningless. If you now look at “SIV-MAC; Rhesus macaque; M33262” and “SIV-MNE; Pig-tailed macaque; U79412” and then look at “SIV-RCM; Red-capped mangabey; AF349680” and “SIV-RCM; Red-capped mangabey; AF382829” you can see that the branch lengths separating SIV-MAC from SIV-MNE are the same lengths as the branches separating the two SIV-RCM sequences. With the transformed branches you can not draw any conclusions about how similar the sequences are to each other.

To convert the tree back to a phylogram, untick the option Transform branches. To hide the branch lengths, untick the box next to “Show Branch Labels”.

Displaying support values

In addition to the information conveyed by the topology of the tree and the branch lengths of the tree, further information can also be written on the nodes and/or branches of the tree. The information that is available to display will depend on the tree building method and the options used. Often, support values are displayed on the tree.

Tree building methods produce the tree which best explains the information in the alignment; however, it is unlikely this tree will explain all of the variation in the alignment. Not all of the sites in the alignment will support this tree and not all of the clades in the tree will necessarily be strongly supported by the alignment. For example, with rapid speciation events, there may be insufficient information in the alignment to determine the branching pattern of a group of species, and some of the clades in the tree may have only marginally more support than alternative possible clades.

If you look at the tree you have built it is difficult to tell which clades are strongly supported and which are not. For example, does the clade containing “SIV-RCM; Red-capped mangabey; AF382829” and “SIV-RCM; Red-capped mangabey; AF349680” have the same support from the alignment as the clade containing “SIV-MND; Mandrill; AY159322” and “SIV-MND; Mandrill; AF367411”?

To find out how strongly the alignment supports each of the clades in the tree, we can calculate support values. In the tree building options you selected the “Bootstrap” resampling method. The bootstrap statistic for a clade in the tree is the percentage of times that clade appeared in the set of bootstrap replicate trees. This percentage ranges from 0% (the clade did not appear in any of the bootstrap trees) to 100% (the clade appeared in all of the bootstrap trees). A bootstrap replicate tree is generated by randomly sampling sites, with replacement, from the alignment, to create a new randomised alignment and then building a tree from this sampled alignment. This process is repeated for the specified number of bootstrap replicates (in your case, this was 100).

To show the bootstrap values on the tree, tick the box next to Show Branch Labels and select Consensus Support (%) from the dropdown box next to “Display”.

The bootstrap value for a clade will appears to the left of the most recent common ancestral node for that clade.

Now the bootstrap values are displayed on the tree, you can see that there is strong support (100%) for the clade containing the SIV-RCM sequences. However the clade containing the two mandrill sequences has less support (55%). Note that due to the nature of the bootstrapping process, the support values on your tree may be slightly different.

Sometimes it is useful to collapse nodes that have little bootstrap support so that these do not contribute to the topology of the tree. This can be done in the bootstrapping options when the tree is built by changing the Support threshold value. If this is set on 50%, nodes with bootstrap support of less than 50% will be collapsed into polytomies. The screenshot below shows an example where the nodes with 38% and 36% bootstrap support in (A) are collapsed when the support threshold is set to 50% (B).

Exercise 3: Molecular Phylogenetics of HIVs and SIVs

Find the HIV-1 sequences on the tree you created

Question 1: From which African primate species did humans contract HIV-1?

>

Question 2: How many times did this occur? Explain your answer.

>

Question 3: How much bootstrap support is there for your conclusions?

>

Find the HIV-2 sequences on the tree.

Question 4: From which African primate species did humans contract HIV-2?

>

Question 5: How many times does this appear to have happened? Explain your answer.

>

Question 6: How much bootstrap support do you have for your conclusions?

>

Find the rhesus macaque (MAC), stump-tailed macaque (STM) and pig-tailed macaque (MNE) SIV sequences.

Question 7: From which African primate species did these Asian macaque species contract an SIV in captivity?

>

Exercise 4: The Origin of the HIV-1 Pandemic

In trying to understand why the HIV-1 pandemic started, one of the key questions has been, “when did it start?” From early analyses of limited molecular sequence data to more recent analyses of much more abundant data, the following hypotheses have been proposed:

  1. HIV-1 has been circulating in humans for a long time, perhaps thousands of years.
  2. HIV-1 is a new pathogen in humans, having jumped from a primate species in the last 100 years or so.
  3. HIV-1 was introduced into humans inadvertently during a massive polio vaccine trial in central West Africa in the late 1950s, in which chimpanzee tissue was used to develop the vaccines.

Test these hypotheses by calculating when the HIV-1 pandemic strains originated. You can do this because, although the sequences tend not to evolve in a clock-like manner over the entire tree, the sequences of the the HIV-1 pandemic isolates do appear to evolve in a clock-like manner.

Using the tree you generated, view the lengths of branches that lead from the HIV-1 Group M sequences (subtypes A-K) to their common ancestor. You will need to use the zoom function under the General tab to view these branches and their lengths clearly. Adjust the font as needed.

Question 8: Calculate the mean sum from the tip of a branch to the common ancestor of Group M. (To view the branch lengths on the tree select “Substitutions per site” from the dropdown box next to “Display” under “Show Branch Labels”)

>

Question 9: Assuming that the substitution rate of HIV-1 is approximately 10-5 substitutions per nucleotide site per generation, use the mean sum of branch lengths to calculate how many HIV-1 generations have elapsed since the Group M sequences diverged from a common ancestor.

>

Question 10: An HIV-1 generation lasts about 2 days. This is the time it takes for a virus particle to infect a cell and produce new virus particles ready to infect new cells. From this generation time, calculate how many years ago the Group M sequences diverged from a common ancestor.

>

Question 11: Approximately in which year did the HIV-1 pandemic sequences originate assuming the average sampling year was 1990?

>

Question 12: Which of the above hypotheses about when the pandemic sequences originated does your result support?

>

You have completed this tutorial. The answers to the above questions can be found here.