load("data/Ex2.Results.Rdata")
For this part, you only care about Control condition. SD condition is filtered out !.
Using boxplot (basic or ggplot2), show Spcs1 gene expression: boxplots must be separated by Strain and Time
Assuming this gene expression follows a normal distribution, you want to write a statistical.
you write to linear models to fit Spcs1 expression:
~ Strain * Time_as_Factor
~ Strain + Strain:Time_as_Factor
What is the difference between these two models ? What the coefficients represent ?
First, we filter our data to only keep control condition
idx<-metadata$Condition == "Ctr"
metadata_ctr<-metadata[idx,]
expression_ctr<-expression[,idx]
# We refactor Strain, to remove "d2" and "NA" levels
metadata_ctr$Strain<-factor(metadata_ctr$Strain)
We use boxplot to plot gene expression for each Strain and Time
boxplot(expression_ctr["Spcs1",]~metadata_ctr$Time + factor(metadata_ctr$Strain),xlab="Time.Strain",ylab="Spcs1 expression")
Using ggplot2 you need your x value to be factor, and fill by Strain.
MyData<-cbind.data.frame(metadata_ctr,Expression=expression_ctr["Spcs1",])
bp<-ggplot(aes(x=factor(Time),y=Expression,fill=Strain),data=MyData) + geom_boxplot() +
scale_x_discrete(breaks=c(0,6,12,18),labels=c(0,6,12,18))
bp
Or you can using facet_grid() function to separate by Strain
MyData<-cbind.data.frame(metadata_ctr,Expression=expression_ctr["Spcs1",])
bp<-ggplot(aes(Time,Expression,group=Time,fill=Strain),data=MyData) + geom_boxplot()
bp + facet_grid(. ~ Strain) + scale_x_continuous(breaks=c(0,6,12,18),labels=c(0,6,12,18))
If we use Time as a numerical value, we consider that gene expression will increase or decrease linearly with time (linear regression). This is often not the case on 24h cycles, where expression will be cyclic. Transforming Time into factors allows to have any pattern of expression.
We add a new variable as Time_as_Factor
metadata_ctr$Time_as_Factor<-factor(metadata_ctr$Time)
We write two test, model1 and model2
model1<-lm(expression_ctr["Spcs1",]~ Time_as_Factor * Strain,data=metadata_ctr)
model2<-lm(expression_ctr["Spcs1",]~ Time_as_Factor + Strain:Time_as_Factor,data=metadata_ctr)
See the summary of these two models:
summary(model1)
##
## Call:
## lm(formula = expression_ctr["Spcs1", ] ~ Time_as_Factor * Strain,
## data = metadata_ctr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.063255 -0.024878 0.003387 0.020969 0.062507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.96468 0.02089 524.854 < 2e-16 ***
## Time_as_Factor6 -0.10136 0.02954 -3.431 0.00371 **
## Time_as_Factor12 0.03086 0.03303 0.934 0.36491
## Time_as_Factor18 0.02176 0.02954 0.737 0.47270
## StrainD2 0.07849 0.02954 2.657 0.01794 *
## Time_as_Factor6:StrainD2 0.10413 0.04178 2.492 0.02488 *
## Time_as_Factor12:StrainD2 -0.07169 0.04432 -1.618 0.12655
## Time_as_Factor18:StrainD2 -0.12680 0.04178 -3.035 0.00836 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03618 on 15 degrees of freedom
## Multiple R-squared: 0.7901, Adjusted R-squared: 0.6921
## F-statistic: 8.066 on 7 and 15 DF, p-value: 0.0003841
summary(model2)
##
## Call:
## lm(formula = expression_ctr["Spcs1", ] ~ Time_as_Factor + Strain:Time_as_Factor,
## data = metadata_ctr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.063255 -0.024878 0.003387 0.020969 0.062507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.964676 0.020891 524.854 < 2e-16 ***
## Time_as_Factor6 -0.101365 0.029544 -3.431 0.00371 **
## Time_as_Factor12 0.030864 0.033031 0.934 0.36491
## Time_as_Factor18 0.021764 0.029544 0.737 0.47270
## Time_as_Factor0:StrainD2 0.078495 0.029544 2.657 0.01794 *
## Time_as_Factor6:StrainD2 0.182629 0.029544 6.182 1.76e-05 ***
## Time_as_Factor12:StrainD2 0.006802 0.033031 0.206 0.83961
## Time_as_Factor18:StrainD2 -0.048310 0.029544 -1.635 0.12282
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03618 on 15 degrees of freedom
## Multiple R-squared: 0.7901, Adjusted R-squared: 0.6921
## F-statistic: 8.066 on 7 and 15 DF, p-value: 0.0003841
The fitted value of these two models are exactly identical.
fitted(model1) == fitted(model2)
## 24 25 26 27 28 29 31 32 33 34 35 48 49 50 51 52
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 53 54 55 56 57 58 59
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE
And we see that the first 5 coefficients are identical, only the last 3 coefficients change.
To model the expression for Strain ‘D2’ at time 6-12-18, the first model use the following coefficients (excluding 0)
model1 (D2-Time 6): \(expression = \beta_0 + \beta_1 + \beta4 + \beta5\)
sum(coef(model1)[c(1,2,5,6)])
## [1] 11.04594
mean(expression_ctr["Spcs1", metadata_ctr$Strain == "D2" & metadata_ctr$Time_as_Factor == "6"])
## [1] 11.04594
while model 2 uses: \(expression = \beta_0 + \beta_1 + \beta5\)
sum(coef(model2)[c(1,2,6)])
## [1] 11.04594
In model1, the last 3 coefficients represent the difference of expression between D2 and B6 at time 6,12,18 compared to the difference at time 0 In model2, the last 3 coefficients represent the difference of expression between D2 and B6 at time 6,12,18, independently of the difference at time 0
In model1, the \(\beta_4\) coefficient (StrainD2) is included in all time-points of D2 expression
See. model matrix
model.matrix(model1)
## (Intercept) Time_as_Factor6 Time_as_Factor12 Time_as_Factor18 StrainD2
## 24 1 0 0 0 0
## 25 1 0 0 0 0
## 26 1 0 0 0 0
## 27 1 1 0 0 0
## 28 1 1 0 0 0
## 29 1 1 0 0 0
## 31 1 0 1 0 0
## 32 1 0 1 0 0
## 33 1 0 0 1 0
## 34 1 0 0 1 0
## 35 1 0 0 1 0
## 48 1 0 0 0 1
## 49 1 0 0 0 1
## 50 1 0 0 0 1
## 51 1 1 0 0 1
## 52 1 1 0 0 1
## 53 1 1 0 0 1
## 54 1 0 1 0 1
## 55 1 0 1 0 1
## 56 1 0 1 0 1
## 57 1 0 0 1 1
## 58 1 0 0 1 1
## 59 1 0 0 1 1
## Time_as_Factor6:StrainD2 Time_as_Factor12:StrainD2 Time_as_Factor18:StrainD2
## 24 0 0 0
## 25 0 0 0
## 26 0 0 0
## 27 0 0 0
## 28 0 0 0
## 29 0 0 0
## 31 0 0 0
## 32 0 0 0
## 33 0 0 0
## 34 0 0 0
## 35 0 0 0
## 48 0 0 0
## 49 0 0 0
## 50 0 0 0
## 51 1 0 0
## 52 1 0 0
## 53 1 0 0
## 54 0 1 0
## 55 0 1 0
## 56 0 1 0
## 57 0 0 1
## 58 0 0 1
## 59 0 0 1
## attr(,"assign")
## [1] 0 1 1 1 2 3 3 3
## attr(,"contrasts")
## attr(,"contrasts")$Time_as_Factor
## [1] "contr.treatment"
##
## attr(,"contrasts")$Strain
## [1] "contr.treatment"
model.matrix(model2)
## (Intercept) Time_as_Factor6 Time_as_Factor12 Time_as_Factor18
## 24 1 0 0 0
## 25 1 0 0 0
## 26 1 0 0 0
## 27 1 1 0 0
## 28 1 1 0 0
## 29 1 1 0 0
## 31 1 0 1 0
## 32 1 0 1 0
## 33 1 0 0 1
## 34 1 0 0 1
## 35 1 0 0 1
## 48 1 0 0 0
## 49 1 0 0 0
## 50 1 0 0 0
## 51 1 1 0 0
## 52 1 1 0 0
## 53 1 1 0 0
## 54 1 0 1 0
## 55 1 0 1 0
## 56 1 0 1 0
## 57 1 0 0 1
## 58 1 0 0 1
## 59 1 0 0 1
## Time_as_Factor0:StrainD2 Time_as_Factor6:StrainD2 Time_as_Factor12:StrainD2
## 24 0 0 0
## 25 0 0 0
## 26 0 0 0
## 27 0 0 0
## 28 0 0 0
## 29 0 0 0
## 31 0 0 0
## 32 0 0 0
## 33 0 0 0
## 34 0 0 0
## 35 0 0 0
## 48 1 0 0
## 49 1 0 0
## 50 1 0 0
## 51 0 1 0
## 52 0 1 0
## 53 0 1 0
## 54 0 0 1
## 55 0 0 1
## 56 0 0 1
## 57 0 0 0
## 58 0 0 0
## 59 0 0 0
## Time_as_Factor18:StrainD2
## 24 0
## 25 0
## 26 0
## 27 0
## 28 0
## 29 0
## 31 0
## 32 0
## 33 0
## 34 0
## 35 0
## 48 0
## 49 0
## 50 0
## 51 0
## 52 0
## 53 0
## 54 0
## 55 0
## 56 0
## 57 1
## 58 1
## 59 1
## attr(,"assign")
## [1] 0 1 1 1 2 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$Time_as_Factor
## [1] "contr.treatment"
##
## attr(,"contrasts")$Strain
## [1] "contr.treatment"
Before using heatmap.2, you want to scale the data:
expression_scaled<-t(apply(expression,1,scale))
You can then plot your data:
library(gplots)
heatmap.2(expression_scaled,col=redgreen(75),trace="none",Colv = T,
ColSideColors = c("red","blue")[factor(metadata$Strain)] )
If you use scale=“row” instead, heatmap.2 will use hclust on non-scaled data, then represent data as scaled
heatmap.2(expression,col=redgreen(75),trace="none",Colv = T,scale="row",
ColSideColors = c("red","blue")[factor(metadata$Strain)] )