4. Molecular evolution and phylogeny inference
Exercise 1 - Molecular evolution in the flu virus
Note: this exercise was part of a graded assignment for this class in a previous year.
The file HA_sequences.fasta contains a list of nucleotide sequences of the gene coding for hemagluttinin (HA), from influenza viruses sampled between 1968 and 2005. In the fasta format, each sequence comes with a header that contains annotations: here, the header contains the year of sampling.
- Load the sequences and inspect the data. In R, you may use the seqinr::read.fasta function for this, part of the seqinr package. How many sequences are there? What is the length of each sequence?
Hint: You can use the functions getAnnot(), getSequence or getLength() within the seqinr toolbox to extract the header, sequence or length of a sequence respectively.
Calculate the Hamming distance between the first sequence (A/Aichi/2/1968) and each of the other sequences. In R, you may use the DescTools::StrDist function for this, part of the DescTools package. Remark 1: this package requires R version >= 4.2.0. Remark 2: remember that the Hamming distance should be between 0 and 1. Also calculate the Jukes-Cantor distance between the first sequence (A/Aichi/2/1968) and each of the other sequences. Plot both of them versus the sampling year. Comment: what is the trend of these distances? What fraction of the HA gene has changed due to mutations during this 37 year period? How many mutations per site on average does this correspond to?
If you wanted to construct a phylogenetic tree from the sequences considered here, do you think that the UPGMA method would give a reasonable result? Justify your answer. You do not need to construct a tree.
Calculate the Hamming distances between each pair of strains from the same year. Do this for all years, obtaining a list of Hamming distances between strains from the same year. (This calculation takes some time.) Plot the distribution of all these distances in a single histogram (including the data corresponding to all years). Calculate the mean and the maximum value of these distances. Comment: compare to the results from question 2.
Hint: To plot a histogram, one option is to use the hist() function.
- Focusing on Hamming distances for simplicity, estimate how long it would take for sequences to accumulate a number of differences corresponding to the average distance between sequences from the same year.
Exercise 2 - Phylogeny reconstruction for influenza hemagglutinin
In this exercise, we will still consider the influenza virus, and the sequences coding for hemagglutinin, but we will focus on a smaller set of sequences from H3N2 influenza viruses collected in the US from 1993 to 2008. They are in the file “US_sequences_93_08.fasta”. Since the headers are not very informative, they are supplemented by a file of annotations called “US_sequences_93_08_annotations.csv”. We will use the ape: Analyses of Phylogenetics and Evolution package. If necessary, please install it, as well as the adegenet package.
Load the sequences and inspect the data. For the data format to be easy to use with ape, this time you can use read.dna for this. Load the annotations too, and inspect them. You can use read.csv for this.
Calculate all pairwise distances between sequences using the Jukes-Cantor model. For this you can use dist.dna. Next, infer a neighbor joining (NJ) tree from this list of distances, using bionj, and then ladderize. Plot the tree using plot.
For a more informative data visualization, we will now aim at annotating each leaf of the tree with the year that the sequence was collected in. Use data from the annotation file for this, and represent the year visually by a color gradient as well as by a tip label for each leaf.
Hint 1: you can use the tiplabels() function from the ape package to add the year to the tip of the tree. Use the “bg” flag within tiplables() to provide the color for the tip (see also Hint 2).
Hint 2: you can use the num2col() function from the adegenet package to convert a numeric value (year) to a color based on a user-defined color palette (col.pal).
Until now, the tree was unrooted. Since each sequence is annotated with the year it was collected in, we can use one of the oldest sequences to define the root. After inspecting the annotations, root the tree using root. Comment on the tree you obtained.
To assess the quality of the NJ tree you constructed, evaluate the patristic distances (=distances along the tree, based on the branch lengths) using cophenetic, and compare them to the Jukes-Cantor distances between sequences you determined in question 2. For this, you can plot the patristic distances versus the Jukes-Cantor distances between sequences. Comment on the resulting plot.
Evaluate the confidence in each node using the bootstrap approach, by using the function boot.phylo. Comment on the bootstrap values obtained, given that the maximum possible value is 100.
Exercise 3: Plot, analyze and manipulate a gene tree
Here, we will start from a gene tree covering many species that was constructed using a specialized maximum likelihood tree inference software, specifically RaxML. We will continue to use the R package ape to plot, annotate and manipulate this tree.
Load the file RAxML_bipartitions.Homologs using read.tree and plot the tree using plot.phylo
Re-root the tree at the base of all bird genes (find all tree tips corresponding to penguin using grep, then getMRCA and root. Then plot the new tree using plot.phylo
Display the bootstrap values > 80 with nodelabels. Annotate the tree by displaying the Barn_owl, human and Atlantic_salmon genes in 3 different colors. Re-plot the tree showing these annotations.
Based on this tree, how many paralogs of this gene exist in mammals, in birds and in fish? Which rat gene is the ortholog of human_5187?
Use the methods drop.tip, keep.tip and extract.clade to plot the following subtrees:
- All non-fish species (remove all salmon, zebrafish and torafugu)
- The clade containing human_8863 and house_mouse_18628 (a clade contains all descendents of the last common ancestor of these 2 leaves)
- Only the birds (penguins and owls)