3. Hidden Markov Models

Author

EPFL - SV - BIO-463

Published

March 4, 2025

Exercise 1: Translate genes, find longest open reading frame

Load the file GeneSequences.fa

library("Biostrings")
genes = readDNAStringSet("GeneSequences.fa")

Translate the sequences (in 3 frames: only the forward strand)

f1 = genes
f2 = ### second frame of "genes"
f3 = ### third frame
frames = translate(c(f1, f2, f3), no.init.codon=T, if.fuzzy.codon="solve")

Find the longest ORF in each sequence: use matchPattern for residues “*” and “M”

### will store the longest orf for each sequence in this variable
longest.orf = AAStringSet(rep('', length(genes)))
names(longest.orf) = names(genes)
### loop through every frame
for (nf in 1:length(frames)) {
    frm = frames[[nf]]
    nme = names(frames)[[nf]]
    #### search "*" in seqn and loop through results
    stops = start(matchPattern("*", frm))
    #### then search first "M" between last and current "*"
    #### with n0 = position of M, n1 = position of *
    #### if seqn[n0:n1] is longer than longest.orf[nme], then replace it
}

Save it as a fasta file named orf.fa

writeXStringSet(...)

Exercise 2: Construct an HMM to find ORFs

Implement an HMM according to the schema below

The states S1, S2, S3 represent the 3 consecutive nucleotides of a start codon, E1, E21, E22, E32, E33 represent the 3 possible stop codons, B is background and I[123] are “inner” codons.
The emitted symbols are nucleotides A, C, G, T, background and inner codons emit with uniform probabilities.
The emission probabilities of start or end states must be specified (easy to guess).
The transition probabilities not specified on the schema should be easy to guess. The probabilities must be calculated so that the 3 possible stop codons have the same probability.

Complete the code below by filling in all matrix elements:

states = c("B", "S1", "S2", "S3", "I1", "I2", "I3", "E1", "E21", "E22","E31", "E32")
nstates = length(states)
symbols = c("A", "C", "G", "T")
nsym = length(symbols)

Emat = matrix(0, ncol=nsym, nrow=nstates, dimnames=list(states, symbols))
Mmat = matrix(0, ncol=nstates, nrow=nstates, dimnames=list(states, states))

Create the corresponding HMM object (see code below)

library("aphid")

### convert DNA sequence to a list of individual characters,
### keep only the positions 1501-1800
seq = strsplit(as.character(genes[[1]]), '')[[1]][1501:1800]

### we create an artificial "Begin" that goes directly to "B":
### add 1 row and 1 column to Mmat
M2 = cbind(rep(0, nstates+1), rbind(rep(0, nstates), Mmat))
colnames(M2)[1] = "Begin"
rownames(M2)[1] = "Begin"
### transit from "Begin" to "B"
M2["Begin", "B"] = 1

### Emat / Mmat are in units of probability,
### the HMM calculations are with log(probability)

hmm.orf = structure(list(A=log(M2), E=log(Emat), qe=rep(.25,4)), class="HMM")

Plot the HMM schema (see plot.HMM)
Run the Viterbi algorithm on the segment 1501:1800 of the human gene and display the resulting states

### plot...

### Viterbi algo
hmm.vtrb = Viterbi(....)

### for a simple visual display: concatenate all nucleotides into one string
### and show the 1st letter of each state name aligned below
vis.gene = as.character(genes[[1]][1501:1800])
state.letter = substr(states,1,1)
vis.hmmpath = paste(state.letter[hmm.vtrb$path+1], collapse='')
c(vis.gene, vis.hmmpath)