The states S1, S2, S3 represent the 3 consecutive nucleotides of a start codon, E1, E21, E22, E32, E33 represent the 3 possible stop codons, B is background and I[123] are “inner” codons.
The emitted symbols are nucleotides A, C, G, T, background and inner codons emit with uniform probabilities.
The emission probabilities of start or end states must be specified (easy to guess).
The transition probabilities not specified on the schema should be easy to guess. The probabilities must be calculated so that the 3 possible stop codons have the same probability.
states =c("B", "S1", "S2", "S3", "I1", "I2", "I3", "E1", "E21", "E22","E31", "E32")nstates =length(states)symbols =c("A", "C", "G", "T")nsym =length(symbols)### Most emission prob. will be 1/4 so fill the entire matrix with 1/4,### then modify the ones that are differentEmat =matrix(1/4, ncol=nsym, nrow=nstates, dimnames=list(states, symbols))### start is ATGEmat["S1",] =c(1, 0, 0, 0)Emat["S2",] =c(0, 0, 0, 1)Emat["S3",] =c(0, 0, 1, 0)### stop is one of TAG, TAA, TGA### first char is TEmat["E1",] =c(0, 0, 0, 1)### second is A or GEmat["E21",] =c(1, 0, 0, 0)Emat["E22",] =c(0, 0, 1, 0)### third is A or G after A, and only A after GEmat["E31",] =c(0, 0, 1, 0)Emat["E32",] =c(1, 0, 0, 0)### Transition matrix: according to the schemaMmat =matrix(0, ncol=nstates, nrow=nstates, dimnames=list(states, states))Mmat["B", "B"] =0.98Mmat["S1", "S2"] =1Mmat["S2", "S3"] =1Mmat["S3", "I1"] =1Mmat["I1", "I2"] =1Mmat["I2", "I3"] =1Mmat["I3", "I1"] =0.9Mmat["E31", "B"] =0.95Mmat["E32", "B"] =0.95### Sum of probabilities going out of any state must be 1Mmat["B", "S1"] =0.02Mmat["I3", "E1"] =0.1Mmat["E31", "S1"] =0.05Mmat["E32", "S1"] =0.05### These were calculated so that each path through the E states has probability 1/3Mmat["E1", "E21"] =2/3Mmat["E1", "E22"] =1/3Mmat["E21", "E31"] =0.5Mmat["E21", "E32"] =0.5Mmat["E22", "E32"] =1
Run the Viterbi algorithm on the segment 1501:1800 of the human gene and display the resulting states
library("aphid")### convert DNA sequence to a list of individual characters,### keep only the positions 1501-1800seq =strsplit(as.character(genes[[1]]), '')[[1]][1501:1800]### we create an artificial "Begin" that goes directly to "B":### add 1 row and 1 column to MmatM2 =cbind(rep(0, nstates+1), rbind(rep(0, nstates), Mmat))colnames(M2)[1] ="Begin"rownames(M2)[1] ="Begin"### transit from "Begin" to "B"M2["Begin", "B"] =1### Emat / Mmat are in units of probability,### the HMM calculations are with log(probability)hmm.orf =structure(list(A=log(M2), E=log(Emat), qe=rep(.25,4)), class="HMM")plot(hmm.orf)