The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), āpatternedā precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35). The data required for this practical session can be downloaded from Zenodo.
The gene expression count data was obtained from Kallisto (Bray et
al., 2016) using the Gencode hg38 release Homo sapiens
transcriptome.
load("./data_09_04_2024.RData")
#Data:
# myE_ge : raw gene expression count matrix
# info : sample annotation (data-frame)
# ttg : rows (genes) annotation
# Focus on CTRL samples for this session
sel_samples <- which(info$mutant=="CTRL")
myE_ge <- myE_ge[,sel_samples]
info <- info[sel_samples,]
info$group <- factor(paste(info$Fraction,info$DIV,sep="_"),levels=unique(paste(info$Fraction,info$DIV,sep="_")))
#Make some nice colors to facilitate the visualisation of time-points
mytime <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days) <- c(0,3,7,14,22,35)
mycols <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))How many samples is there in your count data?
Does this correspond to your experimental protocol?
Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file.
What are the covariates?
how many rows?
Check that the rowID of the data-count corresponds to the ensembl gene ID in your gene annotation file.
#View(info)
#nrow(info)
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
match(colnames(myE_ge),info$sampleID)## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
sum(is.na(match(rownames(myE_ge),ttg$ens_gene)))## [1] 0
How does your data look-like?
par(mfrow=c(2,3))
plot(density(myE_ge[,1]),main="Read count distribution in sample 1")
boxplot(myE_ge,las=1,ylab="raw gene count",main="with outliers")
boxplot(myE_ge,outline=FALSE,las=1,ylab="raw gene count",main="without outliers")
plot(density(t(myE_ge[1,])),main="Read count distribution across samples")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),las=1,ylab="raw gene count",main="with outliers")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),outline=FALSE,las=1,ylab="raw gene count",main="without outliers")Analysis of the variance of the gene count across the \(N\) samples: \(\sigma^2=Var(X)=\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2\) and \(\mu=\frac{1}{n}\sum_{i=1}^n\)x_i:
Letās have a close look at how the variance in gene expression scale with average and then correct it with some transformation.
#calculate mean and variance of the rows
row_avg<-apply(myE_ge,1,mean)
row_var <-apply(myE_ge,1,var)
#Log-transform your count data
myE_gel <- log2(1+myE_ge)
par(mfrow=c(1,2))
#Variance scale with average --> Poisson distribution
plot(row_avg,row_var,las=1,main="RAW data",pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="mean read count",ylab="variance read count",xlim=c(0,5000),ylim=c(0,10^7))
grid()
plot(apply(myE_gel,1,mean),apply(myE_gel,1,var),las=1,main="Log2-transformed data",pch=19,col=rgb(0,0,0,0.1),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()Letās first have a look at the distribution of gene expression for a few samples:
par(mfrow=c(1,3))
geneplotter::multidensity(myE_gel[,c(1,2,4,10)],main="Read count distributions",las=1,xlab="read count [log2]")
plot(density(myE_gel[,1]),las=1,main=colnames(myE_gel)[1],las=1,xlab="read count [log2]")
plot(density(myE_gel[,3]),las=1,main=colnames(myE_gel)[3],las=1,xlab="read count [log2]")We can identify reliably expressed genes by fitting a bimodal distribution to the log2-read count distribtution of each samples to discriminate between the foreground and the background transcription. The limit must be fitted to each individual sample given that the library size will impact this factor.
We can now select for the reliably expressed genes in each sample:
Letās first look at the ditribution of the gene count across a few
samples:
There are several options for normalisation. Scaling factors,
quantile normalisation etc.
What is the effect on the gene count?
Hierarchical clustering of the samples is frequently used to analyse whether similar samples cluster together.
Letās first compare the clustering of the samples using the Manhattan
distance and Ward D algorithm:
Samples are very similarly clustered in Quantile and Deseq2 normalised methods. Letās compare the different effect of the distances used as well as agglomerative alorithm. From now we will use the quantile normalised matrix: