Phylogenetic Trees Tutorial

Investigate the evolutionary origins of HIV

Note: To complete the tutorial with the referenced data please download the tutorial and install in Geneious Prime.

In this tutorial, you will use Geneious to investigate the evolutionary origins of human immunodeficiency viruses (HIVs) using molecular phylogenetic tools. You will learn how to build a phylogenetic tree from a sequence alignment and how to view and manipulate the tree to answer questions on the origins of HIV-1.

Introduction: Human and Simian Immunodeficiency Viruses
Exercise 1: Phylogenetics – Build a Phylogeny of HIVs and SIVs
Exercise 2: Molecular Phylogenetics of HIVs and SIVs
Exercise 3: The Origin of the HIV-1 Pandemic

Introduction: Human and Simian Immunodeficiency Viruses

HIVs, the causes of acquired immune deficiency syndrome (AIDS), are closely related to simian (monkey and ape) immunodeficiency viruses (SIVs). These and other similar viruses are retroviruses. Retroviruses are characterised by their RNA genomes, which once inside a host cell, are reverse transcribed into DNA and then integrated into the host cell’s genome. The integrated viral genome is known as a provirus. You will be working with proviral DNA sequences.

The origins of HIVs were mysterious when these viruses were first discovered in the early 1980s. There are two types of HIVs. HIV type 1 (HIV-1) is more widespread and causes more severe disease than HIV type 2 (HIV-2). HIV-1 is also far more diverse than HIV-2. HIV-1 is classified into three major groups: M, N, and O. The viruses causing the AIDS pandemic (widespread epidemic) belong to Group M. Group M is subdivided into several subtypes. You will be analysing sequences from HIV-1 Group M Subtypes A, B, C, D, F, G, H, J, K. The HIV-1 viruses infecting people in North America, Europe and Australia are mostly from Group M Subtype B. All groups and subtypes of HIV-1 and HIV-2 are found in Africa.

Both HIV-1 and HIV-2 are closely related to SIVs found in a variety of African primate species. This lead early on to researchers hypothesising that HIVs had jumped to humans from one or more African primate species. It was suggested that close contact between humans and monkeys that were kept as pets or hunted for food had allowed the SIVs to jump hosts. In this tutorial you will use molecular phylogenetics to determine the evolutionary relationships of HIVs and SIVs, and so determine from which African primates HIVs originated.

The Sequences

All retroviruses contain three large genes: gag, pol, and env. The gag gene codes for several structural proteins that form the viral particle (virion) capsid and perform other functions. The pol gene codes for several enzymes, including Reverse Transcriptase. We will be analysing env sequences, which are about 2.5 kilobases (Kb) long. The env gene codes for two viral envelope glycoproteins that are positioned on the virion surface and interact with host cell-surface receptors.

You will be analysing an alignment of 62 env sequences of HIV-1, HIV-2, and various SIVs. The alignment has been made for you because aligning this many long sequences can require considerable computation time.

Click the file ‘HIV_Alignment’ to open the provided alignment.

When viewing the alignment, note that there are many large gaps, which is characteristic of an alignment of a rapidly evolving gene in divergent species.

The sequences are labelled in the format: virus type; followed by the common name of the primate species for the SIV sequences, or the group or subtype for HIV-1 and HIV-2 sequences; finally followed by the accession number.

This alignment contains sequences from various African primate species known to be infected with different SIVs. There are also three non-African species, all from Asia, that have been infected with SIVs in captivity: the pig-tailed macaque, the rhesus macaque and the stump-tailed macaque. The SIVs from all of these primate species are referred to by the three-letter code given with each picture. For example, the SIV from the sooty mangabey is called SIVSMM and the sequence in the alignment or tree is labelled SIV-SMM.

Mona monkey
Cercopithecus mona mona [denti]
MON [DEN]

de Brazza’s monkey
Cercopithecus neglectus
DEB

Tantalus monkey
Chlorocebus tantalus
TAN

Syke’s monkey
Cercopithecus albogularis
SYK

Greater spot-nosed monkey
Cercopithecus nictitans
GSN

Green monkey
Chlorocebus sabaeus
SAB

Mustached guenon
Cercopithecus cephus
MUS

Vervet monkey
Chlorocebus pygerythrus
VER

Grivet
Chlorocebus aethiops
GRV

L’Hoest’s monkey
Cercopithecus lhoest
LST

Sooty mangabey
Cercocebus atys
SMM

Red-capped mangabey
Cercocebus torquatus
RCM

Sun-tailed monkey
Cercopithecus solatus
SUN

Mandrill
Mandrillu sphinx
MND

Drill
Mandrillus leucophaeus
DRL

Pig-tailed macaque
Macaca nemestrina
MNE

Stump-tailed macaque
Macaca arctoides
STM

Rhesus macaque
Macaca mulatta
MAC

Common chimpanzee
Pan troglodytes
CPZ

Exercise 1: Build a Phylogeny of HIVs and SIVs

You will construct a phylogeny using the Neighbour-Joining tree building method and the Tamura-Nei model. Models of evolution describe expected frequencies of each nucleotide and the rate of change between nucleotides. The Tamura-Nei model assumes each base has a different equilibrium frequency and allows transitions and transversions to occur at different rates. It allows the two types of transitions (A ↔ G and C ↔ T) to have different rates. This is useful when analysing HIV sequences because HIV exhibits hyper G-to-A mutation caused by a host enzyme (APOBEC3G). You will use the Neighbour-Joining method because these sequences do not, in general, evolve in a clock-like manner.

Open the provided alignment.

To construct a Neighbour-Joining tree using the Tamura-Nei model, with bootstrapping, click the Tree button and select the Geneious Tree Builder. You should reset any changed parameters to the defaults by pressing the Reset to Defaults button (if this option is greyed out then the defaults settings are already selected).

For the genetic distance model select Tamura-Nei and for the tree build method select Neighbor-Joining. Set the outgroup to “SIV-MON; Mona monkey; AY340701”. This sequence will be used to root the tree.

To calculate support values for the tree use bootstrapping. To do this, tick the box next to Resample tree and select Bootstrap in the dropdown box next to resampling method. Use 100 bootstrap samples and a support threshold of 0. Do not use more than 100 samples for this example analysis, as the more samples you choose, the longer the analysis will take. If you would rather use the provided tree you can do so by clicking here.

The tree building options should now look similar to this:

Click OK to build the tree.

Once the tree builder completes, the tree document will appear in the document table in Geneious and should open automatically.

Viewing and Manipulating Phylogenetic Trees

A phylogenetic tree is a branching diagram of evolutionary relationships. It contains information about the order of evolutionary divergences within, and hence about the relationships among, a group of organisms. It can also contain information about the amount of evolutionary change which occurred between any two branching events. The lines on the the tree are called branches and the intersections of these lines are called nodes. A node represents a branching event in the tree. The branching pattern of a tree is called its topology. The topology shows how organisms are related to one another.

Depending on the size of your screen and the size of the tree, it may not be physically possible to display all of the sequence names on the tree, so Geneious will only display some of the sequence names. To zoom in on the tree, use the Zoom slider under “General” in the panel on the right hand side of the tree view. To expand the distance between the branches of the tree, use the Expansion slider. As the amount of space between the branches increases, more sequence names will be displayed on the tree.

As this tree was created using an alignment in Geneious, the alignment is attached to the tree. Click on the “Alignment View” tab to view the alignment.

The sequences in the alignment are sorted according to the topology of the tree. On the left hand side of the sequence names, you can see the tree topology (this may not be visible if you are working with large trees). Select the “SIV-MON; Mona monkey; AY340701” sequence in the alignment then return to the “Tree View”. This sequence is now selected in the tree as well.

The sequences used to build this alignment and tree have additional meta-data associated with them (this is the data found in the “Properties” field in the “Info” tab in the individual sequence documents). This information can be displayed on the tips of the trees. To display the organism on the tips of the tree, select “Organism” from the box next to “Display” under “Show Tip Labels”.

To display the organism and host organism, hold Ctrl (on Windows) or Cmd (on Macs) and select “Organism” and “Host Organism”. Now the host organism and organism are displayed on the tips of the tree, separated by a comma. To display the sequence names on the tree, select “Names”.

Just as a sentence can be printed using different fonts, or colors of ink, without any change in meaning, so too can trees be represented in different shapes and orientations. The information encoded in the tree remains unchanged, even as the appearance changes. For example, the appearance of the tree can be changed by rotating groups of branches. To rotate the branches, select an internal node in the tree and click the Swap Siblings button at the top of the window. This will rotate the branches in that subtree; however, the degree of relatedness is not altered by rotating branches in a tree. Simply having two names close together in a tree does not imply any close relationship.

Try this with the tree you have created. Select the node in the tree containing the Grivet monkey and the four Vervet monkeys and click the Swap Siblings button.

The order of these samples will change in the tree, but the relationship between the sample from the Grivet monkey and those from the four Vervet monkeys has not changed.

Rooted Trees

Trees may be unrooted or rooted. To view the HIV tree as an unrooted tree, click one of the unrooted views under the “General” options in the panel on the right hand side of the tree view.

Unrooted trees do not tell us much about evolutionary relationships. We cannot tell which node is the ancestor and which are the descendent nodes on the tree. To establish ancestor-descendent relationships we need to identify a suitable outgroup and then root the tree on the branch separating the outgroup from the remainder of the tree (the ingroup). We can specify the root before the building the tree to produce a rooted tree, or we can specify the root after the tree is built to change an unrooted tree to a rooted tree.

When you built the tree of HIV and SIV sequences you specified an outgroup (“SIV-MON; Mona monkey; AY340701”) so Geneious has produced a rooted tree. To view the tree as a rooted tree, click the rooted view under the “General” options in the panel on the right hand side of the tree view.

Rooted phylogenetic trees may be oriented horizontally, as above, or vertically. Here the time axis is implicit, running from left to right. The node at the left end of the tree is the root node, which represents the oldest point on the tree. As we move from the root node, we can identify nodes which are ancestral to their descendent clades. Working in from the tips of the tree enables us to identify close and distant relatives. The degree of relatedness of any two organisms is given by how far back on a rooted tree you must go to find their common ancestor. If, in tracing back to the common ancestor of A and B, you pass the common ancestor of A and C, then you can say that A and C are more closely related than A and B.

On a rooted tree, each node and all of its descendent nodes form a clade. This is what we would commonly refer to as a “branch” on a real tree – the physical branch and all the little branches and leaves attached to it. Because an unrooted tree lacks the time axis described above, it is inappropriate to discuss clades in that context.

Phylograms and cladograms

The lengths of the branches of a tree may be arbitrary (eg. cladogram) or can represent the amount of the evolutionary change (phylogram).

In a phylogram, the lengths of the branches are proportional to the amount of change which occurred between those branching events. As the tree you built was estimated using a distance (1 – similarity) measure (i.e. NJ), the proximity of nodes represents their overall degree of similarity.

To display the lengths of the branches of the tree, in the panel on the right hand side of the tree view, select “Substitutions per site” from the drop down box next to “Display” under “Show Branch Labels”.

On your tree, find “SIV-MAC; Rhesus macaque; M33262” and “SIV-MNE; Pig-tailed macaque; U79412” and look at the length of the branches separating these two taxa. Now find “SIV-MND; Mandrill; AF328295” and “SIV-MND; Mandrill; AF367411” and look at the length of these branches. The length of the branches separating “SIV-MAC; Rhesus macaque; M33262” from “SIV-MNE; Pig-tailed macaque; U79412” is shorter than the length of the branches separating “SIV-MND; Mandrill; AF328295” and “SIV-MND; Mandrill; AF367411”. From this you can conclude that “SIV-MAC; Rhesus macaque; M33262” is more similar to “SIV-MNE; Pig-tailed macaque; U79412”, than “SIV-MND; Mandrill; AF328295” is to “SIV-MND; Mandrill; AF367411”.

If an optimality method (e.g., MP or ML) was used to estimate the tree then the proximity of two nodes reflects the number of evolutionary changes in character states estimated to have occurred between them. If the total branch length from the root of a tree to organism A at one tip is much greater than from the root to organism B at another tip, then you can say that evolution has been faster in the A lineage than in the B lineage for the characters on which the tree was based.

To transform the tree to a cladogram, tick the Transform branches box in the “Formatting” options. In the dropdown box next to Transform select Cladogram

Notice how the branch lengths of the tree change and all of the tips of the tree are aligned on the right hand side of the tree view. With this transformation the lengths of the branches are meaningless. If you now look at “SIV-MAC; Rhesus macaque; M33262” and “SIV-MNE; Pig-tailed macaque; U79412” and then look at “SIV-MND; Mandrill; AF328295” and “SIV-MND; Mandrill; AF367411” you can see that the branch lengths separating “SIV-MAC; Rhesus macaque; M33262” from “SIV-MNE; Pig-tailed macaque; U79412” are the same lengths as the branches separating “SIV-MND; Mandrill; AF328295” from “SIV-MND; Mandrill; AF367411”. With the transformed branches you can not draw any conclusions about how similar “SIV-MAC; Rhesus macaque; M33262” is to “SIV-MNE; Pig-tailed macaque; U79412” in comparison to how similar “SIV-MND; Mandrill; AF328295” is to “SIV-MND; Mandrill; AF367411”.

To convert the tree back to a phylogram, untick the option Transform branches. To hide the branch lengths, untick the box next to “Show Branch Labels”.

Displaying support values

In addition to the information conveyed by the topology of the tree and the branch lengths of the tree, further information can also be written on the nodes and/or branches of the tree. The information that is available to display will depend on the tree building method and the options used. Often, support values are displayed on the tree.

Tree building methods produce the tree which best explains the information in the alignment; however, it is unlikely this tree will explain all of the variation in the alignment. Not all of the sites in the alignment will support this tree and not all of the clades in the tree will necessarily be strongly supported by the alignment. For example, with rapid speciation events, there may be insufficient information in the alignment to determine the branching pattern of a group of species, and some of the clades in the tree may have only marginally more support than alternative possible clades.

If you look at the tree you have built it is difficult to tell which clades are strongly supported and which are not. For example, does the clade containing “SIV-GSN; Greater spot-nosed monkey; AF468659” and “SIV-GSN; Greater spot-nosed monkey; AF468658” have the same support from the alignment as the clade containing “SIV-MND; Mandrill; AF328295” and “SIV-MND; Mandrill; AF367411”?

To find out how strongly the alignment supports each of the clades in the tree, we can calculate support values. In the tree building options you selected the “Bootstrap” resampling method. The bootstrap statistic for a clade in the tree is the percentage of times that clade appeared in the set of bootstrap replicate trees. This percentage ranges from 0% (the clade did not appear in any of the bootstrap trees) to 100% (the clade appeared in all of the bootstrap trees). A bootstrap replicate tree is generated by randomly sampling sites, with replacement, from the alignment, to create a new randomised alignment and then building a tree from this sampled alignment. This process is repeated for the specified number of bootstrap replicates (in your case, this was 500).

To show the bootstrap values on the tree, tick the box next to Show Branch Labels and select Consensus Support (%) from the dropdown box next to “Display”.

The bootstrap value for a clade will appears to the left of the most recent common ancestral node for that clade.

Now the bootstrap values are displayed on the tree, you can see that there is strong support for the clade containing “SIV-GSN; Greater spot-nosed monkey; AF468659” and “SIV-GSN; Greater spot-nosed monkey; AF468658”, but not for the clade containing “SIV-MND; Mandrill; AF328295” and “SIV-MND; Mandrill; AF367411”.

Exercise 2: Molecular Phylogenetics of HIVs and SIVs

Find the HIV-1 sequences.

Question 1: From which African primate species did humans contract HIV-1?

Question 2: How many times did this occur? Explain your answer.

Question 3: How much bootstrap support is there for your conclusions?

Find the HIV-2 sequences.

Question 4: From which African primate species did humans contract HIV-2?

Question 5: How many times does this appear to have happened? Explain your answer.

Question 6: How much bootstrap support do you have for your conclusions?

Find the rhesus macaque (MAC), stump-tailed macaque (STM) and pig-tailed macaque (MNE) SIV sequences.

Question 7: From which African primate species did these Asian macaque species contract an SIV in captivity?

Exercise 3: The Origin of the HIV-1 Pandemic

In trying to understand why the HIV-1 pandemic started, one of the key questions has been, “when did it start?” From early analyses of limited molecular sequence data to more recent analyses of much more abundant data, the following hypotheses have been proposed:

HIV-1 has been circulating in humans for a long time, perhaps thousands of years.
HIV-1 is a new pathogen in humans, having jumped from a primate species in the last 100 years or so.
HIV-1 was introduced into humans inadvertently during a massive polio vaccine trial in central West Africa in the late 1950s, in which chimpanzee tissue was used to develop the vaccines.

Test these hypotheses by calculating when the HIV-1 pandemic strains originated. You can do this because, although the sequences tend not to evolve in a clock-like manner over the entire tree, the sequences of the the HIV-1 pandemic isolates do appear to evolve in a clock-like manner.

Using the tree you generated, view the lengths of branches that lead from the HIV-1 Group M sequences (subtypes A-K) to their common ancestor. You will need to use the zoom function under the General tab to view these branches and their lengths clearly. Adjust the font as needed.

Question 8: Calculate the mean sum from the tip of a branch to the common ancestor of Group M. (To view the branch lengths on the tree select “Substitutions per site” from the dropdown box next to “Display” under “Show Branch Labels”)

Question 9: Assuming that the substitution rate of HIV-1 is approximately 10-5 substitutions per nucleotide site per generation, use the mean sum of branch lengths to calculate how many HIV-1 generations have elapsed since the Group M sequences diverged from a common ancestor.

Question 10: An HIV-1 generation lasts about 2 days. This is the time it takes for a virus particle to infect a cell and produce new virus particles ready to infect new cells. From this generation time, calculate how many years ago the Group M sequences diverged from a common ancestor.

Question 11: Approximately in which year did the HIV-1 pandemic sequences originate assuming the average sampling year was 1990?

Question 12: Which of the above hypotheses about when the pandemic sequences originated does your result support?

You have completed this tutorial. The answers to the above questions can be found here.