The exercises for Week 8 are designed to help you become familiar with Singular Value Decomposition (SVD) and understand how this powerful mathematical tool can be used to interpret complex biological experiments that involve time-series data and multiple covariates.
We will continue working with the same dataset used in Week 7. All data pre-processing steps are provided.
The exercises begin in the section titled SVD.
The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), āpatternedā precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35). The data required for this practical session can be downloaded from Zenodo.
The gene expression count data was obtained from Kallisto (Bray et al., 2016) using the Gencode hg38 release Homo sapiens transcriptome.
load("./data_09_04_2024.RData")
#Data:
# myE_ge : raw gene expression count matrix
# info : sample annotation (data-frame)
# ttg : rows (genes) annotation
# Focus on CTRL samples for this session
sel_samples <- which(info$mutant=="CTRL")
myE_ge <- myE_ge[,sel_samples]
info <- info[sel_samples,]
info$group <- factor(paste(info$Fraction,info$DIV,sep="_"),levels=unique(paste(info$Fraction,info$DIV,sep="_")))
#Make some nice colors to facilitate the visualisation of time-points
mytime <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days) <- c(0,3,7,14,22,35)
mycols <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))We first need to preprocess the data as learned during Week 7. The steps include:
# 1. Log-transform your count data
myE_gel <- log2(1+myE_ge)
# 2. Fit bimodal distribution
bimdens <- lapply(c(1:ncol(myE_gel)),function(IX)return(mclust::densityMclust(data=myE_gel[,IX],G=2,plot=FALSE)))
# 3. Identify limit that discriminate foreground from background
Lims <- unlist(lapply(bimdens,function(x)return(qnorm(0.99,mean=x$parameters$mean[1],sd=sqrt(x$parameters$variance$sigmasq[1])))))
# 4. Select reliably expressed genes in each sample
is_expressed_samples <- do.call(lapply(c(1:ncol(myE_gel)),function(IX)return(myE_gel[,IX]>Lims[IX])),what=cbind)
no_reliably_expressed_genes_samples <- apply(is_expressed_samples,2,sum)
# 5. Select reliably expressed genes in each group
is_expressed_groups <- t(apply(is_expressed_samples,1,function(Z)return(tapply(Z,INDEX=factor(info$group),FUN=function(W)return(sum(W)==length(W))))))
no_reliably_expressed_genes_group <- apply(is_expressed_groups,2,sum)
# 6. Select reliably expressed genes in at least one of the groups
is_expressed_global <- apply(is_expressed_groups,1,sum)>=1#167'69 genes are reliably expressed in at least one group
# 7. Create data-table with reliably expressed genes
myE_gelf <- myE_gel[is_expressed_global,]
# 8. Quantile normalisation
myE_gen <- limma::normalizeQuantiles(myE_gelf)While hierarchical clustering, explored in Week 7, can highlight prominent changes driven by a small subset of genes, it may overlook broader, more subtle patterns. In contrast, Singular Value Decomposition (SVD) is a powerful technique for uncovering independent sources of variation in gene expression data, potentially corresponding to distinct biological pathways.
SVD is particularly effective for analyzing data with a time component, as it can reveal gene expression programs that vary over time, even when those changes are nonlinear or not directly proportional to time. This makes SVD a valuable tool for disentangling complex temporal dynamics in biological systems.
In the context of gene expression analysis, SVD decomposes the data matrix into a set of orthogonal components (also known as singular vectors), each associated with a singular value. The square of each singular value reflects the amount of variance captured by each component. To compute the proportion of variance explained, each squared singular value is divided by the sum of all squared singular values.
A scree plot is a graphical representation of how much variance in the original data is captured by each component obtained from Singular Value Decomposition (SVD). It helps to identify how many components are meaningful. The point at which the curve levels offāthe so-called āelbowāāindicates where additional components contribute relatively little to explaining the overall variance.
The Shannon Entropy of the distribution of variance explained by each SVD component is a useful metric for assessing how information is distributed across components. If the variance is concentrated in a single component, the entropy will be close to zero, indicating that most of the structure in the data is captured by that one component. Conversely, if the variance is evenly spread across many components, the entropy will be higher, reflecting a more complex or diffuse structure
In omics data, itās common for the first component to explain a large proportion of the varianceāoften over 90%. This dominant component usually reflects systematic variability in basal gene expression levels between genes, rather than specific biological signals of interest.
getFractionVariance<- function(mySVD){
return(mySVD$d*mySVD$d/sum(mySVD$d*mySVD$d))
}
getShannonEntropy <- function(pip)return(-1*sum(pip*log10(pip))/log10(length(pip)))
ScrePlot <- function(pi,dp){
barplot(pi,las=1,cex.main=0.7,cex.axis=1.0,col="black")
mtext(side = 1, line = 2, text ="principal components", col = "black",cex=0.7, font=1)
mtext(side = 2, line = 2, text ="Fraction of explained variance", col = "black",cex=0.7, font=1)
mtext(side = 3, line = -2, text = paste("Shannon Entropy = ",round(dp,digits=3)), col = "black",cex=0.7, font=1)
}
#Get fraction variance
pi_1 <- getFractionVariance(mySVD=SVD_eset)
#Calculate the SE
dp_1 <- getShannonEntropy(pi_1)
ScrePlot(pi_1,dp_1)Since the first SVD component often captures the majority of variance in gene expression dataātypically reflecting systematic or basal variability rather than meaningful biological patternsāit can be informative to remove it.
This can be done by subtracting the rank-one approximation (i.e., the image of the first component) from the original data matrix. This operation effectively filters out the dominant source of variation, allowing downstream analyses to focus on secondary components that may represent more biologically relevant signals.
We can now visualize the data after removing the first component to better explore these subtler patterns.
#1. Remove the first principal component:
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
#2. Calculate again the SVD
SVD_2 <- svd(datm)
#3. Calculate SE and fraction of variance captured
pi_2 <- getFractionVariance(mySVD=SVD_2)
dp_2 <- getShannonEntropy(pi_2)
#3. Check the effects
par(mfrow=c(2,4))
#3.1 SE and Scre plot
ScrePlot(pi_1,dp_1)
mtext(side=3,line=0,text="prior removing first component")
#3.2 Effect on the sample distribution
multidensity(dat1[,c(1,10,15,20)],main="prior filtering")
boxplot(dat1,outline=FALSE)
#3.3 Effect on the gene expression distribution
boxplot(t(dat1[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")
ScrePlot(pi_2,dp_2)
mtext(side=3,line=0,text="after removing first component")
multidensity(datm[,c(1,10,15,20)],main="after filtering")
boxplot(datm,outline=FALSE)
boxplot(t(datm[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")A commonly used visualization in PCA and SVD analyses is the scatter plot of samples projected onto the first few components. This type of plot is particularly useful when the dataset contains one or two covariates, as it can reveal patterns, groupings, or trends associated with these variables.
Although many studies focus primarily on the first two components, the scree plot provides guidance on how many components are truly informative and worth exploring.
In the following section, you will generate scatter plots of the samples projected onto the first four SVD components, which may help uncover more nuanced structure in the data.
par(mfrow=c(4,4),mar=c(4,4,1,1))
for(i in c(1:4)){
for(j in c(1:4)){
plot(SVD_2$v[,i], SVD_2$v[,j],col=mycols,xlab=paste("v",i),ylab=paste("v",j),pch=19,cex=1.5,las=1)
if(i==1 & j==1){
legend("topright",pch=19,col=mycols_days,ncol=1,leg=paste("d",names(mycols_days)),cex=0.6)
}
}
}In the previous section, we used classic scatter plots to explore the main SVD components. However, a more insightful approach to investigate the temporal dynamics of gene regulatory programs is to examine how the component loadings change over time.
To do this, we plot the average loading values of samples grouped by time point and condition (e.g., nucleus vs.Ā cytoplasm). This allows us to observe how each component behaves across time and under different experimental conditions, potentially revealing distinct temporal expression patterns.
#Extract the loading of the nuclear samples as well as info; create a new grouping
myV_nuc <- SVD_2$v[which(info$Fraction=="Nuclear"),]
info_nuc<- info[which(info$Fraction=="Nuclear"),]
mygroup_nuc <- factor(as.character(info_nuc$DIV),levels=c(0,3,7,14,22,35))
#Extract the loading of the cytoplasmic samples as well as info
myV_cyt <- SVD_2$v[which(info$Fraction=="Cytoplasmic"),]
info_cyt<- info[which(info$Fraction=="Cytoplasmic"),]
mygroup_cyt <- factor(as.character(info_cyt$DIV),levels=c(0,3,7,14,22,35))
#Compute the average and SD per group per singular vectors
myMean_nuc <- t(apply(myV_nuc,2,function(x)return(tapply(x,INDEX=mygroup_nuc,FUN=mean))))
mySD_nuc <- t(apply(myV_nuc,2,function(x)return(tapply(x,INDEX=mygroup_nuc,FUN=sd))))
myMean_cyt <- t(apply(myV_cyt,2,function(x)return(tapply(x,INDEX=mygroup_cyt,FUN=mean))))
mySD_cyt <- t(apply(myV_cyt,2,function(x)return(tapply(x,INDEX=mygroup_cyt,FUN=sd))))
#Plot component over time
days <- c(0,3,7,14,21,35)
cats <- c("Gene Expression","AltEx","IR","3UTR","PUD")
CEX<- 0.7
par(mfrow=c(3,3),mar=c(5,5,2,0),oma=c(2,2,2,2))
error.bar <- function(x, y, upper, lower=upper, length=0.1,...){
if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))
stop("vectors must be same length")
arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)
}
for(i in c(1:9)){
MIN=min(0,min(c(myMean_nuc[i,]-mySD_nuc[i,],myMean_cyt[i,]-mySD_cyt[i,])))
MAX=max(c(myMean_nuc[i,]+mySD_nuc[i,],myMean_cyt[i,]+mySD_cyt[i,]))
plot(days,myMean_nuc[i,],pch=19,type="b",lty=2,ylim=c(MIN,MAX),las=1,frame="F",xlab="time [days]",cex=1.0,cex.axis=CEX,cex.lab=CEX,ylab="")
mtext(side=2,line=3,text=paste("PC",i,sep=""),cex=CEX)
mtext(side=3,line=0,text=paste(round(100*getFractionVariance(SVD_2)[i],digit=2),"% of variance"),cex=CEX)
grid()
error.bar(x=days, y=myMean_nuc[i,], upper=mySD_nuc[i,], length=0.05)
abline(h=0,col="red",lty=2)
points(days,myMean_cyt[i,],pch=19,type="b",cex=1.0,cex.axis=CEX,col="grey")
lines(days,myMean_cyt[i,],lty=2,col="grey")
error.bar(x=days, y=myMean_cyt[i,], col="grey",upper=mySD_cyt[i,], length=0.05)
if(i==1){
legend("topright",pch=19,col=c("black","grey"),cex=0.6,leg=c("nuc","cyto"),ncol=2)
}
}After identifying the most interesting componentsāthose that show low within-group variability and a biologically meaningful temporal dynamicāwe can explore which genes are associated with these components. This is done by examining the left singular vectors, which represent the gene loadings.
There are two main strategies for identifying relevant genes:
Contribution-based selection: Extract genes that contribute the most (positively or negatively) to a component. These are the genes with the largest loadings in the left singular vectorsāindicating large projection onto that component.
Correlation-based selection: Identify genes whose expression profiles correlate most (positively or negatively) with the component scores across samples.
Both strategies provide insights into different aspects of how genes relate to the underlying structure captured by SVD.
Once weāve identified the genes that contribute most (positively and negatively) to a given component, itās important to visually inspect their expression patterns to confirm they behave as expected.
This can be done using boxplots or heatmaps, which help assess whether the gene expression patterns align with the trends captured by the components.
Weāll begin with boxplots, which allow us to compare expression levels across different conditions and time points for the top contributing genes.
Another useful visualization is the heatmap, which provides an intuitive overview of expression patterns across all samples for selected genes.
Here, we will generate a heatmap for the genes that contribute most positively and most negatively to the first component. This allows us to assess whether these genes exhibit coherent expression patterns that align with the signal captured by the component.
You are encouraged to extend this analysis to other components as well, to explore additional gene expression programs present in the data.
#In order to faciliate the visualisation of gene dynamics across the samples, it is often the case that data are standardized.
standardize <- function(z){
rowmed <- apply(z, 1, mean)
rowmad <- apply(z, 1, sd)
rv <- sweep(z, 1, rowmed)
rv <- sweep(rv, 1, rowmad, "/")
return(rv)
}
#Function to plot the Heatmap
PlotHeatmapSelSVD <- function(dat1=myE_gen,mySelection=mySelb,myVi=c(1:6),colord=myorder){
for(i in myVi){
vi=i
tx <- mySelection[[vi]][[2]]
mat <- as.matrix(dat1[match(tx,rownames(dat1)),colord])
GS <- as.character(ttg$ext_gene)[match(rownames(mat),ttg$ens_gene)]
rownames(mat) <- GS
mat <- standardize(mat)
dd <- as.matrix(dist(mat,method="man"))
diag(dd) <- 0
dd.row <- as.dendrogram(hclust(as.dist(dd),method="ward.D"))
row.ord <- order.dendrogram(dd.row)
mypalette <- colorRampPalette(colors=c("blue","white","red"), bias = 1, space = c("rgb"), interpolate = c("linear"))
mycols <- mypalette(n=100)
b <- seq(from=-max(abs(mat)),to=max(abs(mat)),length.out=101)
heatmap.2(mat[row.ord,], keysize=1,mar=c(6,5),col=mycols,breaks=b, scale="none",main=paste("V",i,"+"),Rowv=FALSE,Colv=NA,key=TRUE,symkey=FALSE, density.info="none", trace="none",cexCol=0.5, cexRow=0.5, font.lab=1,dendrogram="none")
tx <- mySelection[[vi]][[3]]
mat <- as.matrix(dat1[match(tx,rownames(dat1)),colord])
GS <- as.character(ttg$ext_gene)[match(rownames(mat),ttg$ens_gene)]
rownames(mat) <- GS
mat <- standardize(mat)
dd <- as.matrix(dist(mat,method="man"))
diag(dd) <- 0
dd.row <- as.dendrogram(hclust(as.dist(dd),method="ward.D"))
row.ord <- order.dendrogram(dd.row)
mypalette <- colorRampPalette(colors=c("blue","white","red"), bias = 1, space = c("rgb"), interpolate = c("linear"))
mycols <- mypalette(n=100)
b <- seq(from=-max(abs(mat)),to=max(abs(mat)),length.out=101)
heatmap.2(mat[row.ord,], keysize=1,main=paste("V",i,"(-)"),mar=c(6,5),col=mycols,breaks=b, scale="none",Rowv=FALSE,Colv=NA,key=TRUE,symkey=FALSE, density.info="none", trace="none",cexCol=0.5, cexRow=0.5, font.lab=1,dendrogram="none")
}
}After validating the expression dynamics of the top contributing genes, the next step is to assess the biological significance of the identified components.
This can be achieved by performing gene set enrichment analysis on the genes that contribute most strongly (positively or negatively) to each component. This analysis helps identify biological pathways or functional categories that are overrepresented, providing insight into the gene regulatory programs captured by the singular vectors.
By linking components to known pathways, we can better interpret the biological meaning of the structure uncovered by SVD.
Below are the Gene Ontology Biological Processes (GO BPs) enriched among the genes that contribute most positively to Principal Component 1 (PC1).
Below are the Gene Ontology Biological Processes (GO BPs) enriched among the genes that contribute most negatively to Principal Component 1 (PC1).