1 Load the data
2 Get familiar with your data
3 Pre-processing of the count data
4 Unsupervised hierarchical clustering analysis of the samples

1 Load the data

The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), ‘patterned’ precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35). The data required for this practical session can be downloaded from Zenodo.

$Schematic depicting the iPSC differentiation strategy for motor neurogenesis. Arrows indicate sampling time-points in days when cells were fractionated into nuclear and cytoplasmic compartments prior to deep (polyA) RNA-sequencing. Four iPSC clones were obtained from four different healthy controls and three iPSC clones from two ALS patients with VCP mutations: R155C and R191Q; hereafter termed VCPmu. Induced-pluripotent stem cells (iPSC); neural precursors (NPC); “patterned” precursor motor neurons (ventral spinal cord; pMN); post-mitotic but electrophysiologically inactive motor neurons (MN); electrophysiologically active MNs (mMN).$ The gene expression count data was obtained from Kallisto (Bray et al., 2016) using the Gencode hg38 release Homo sapiens transcriptome.

load("./data_09_04_2024.RData")
#Data: 
# myE_ge                  : raw gene expression count matrix 
# info                    : sample annotation (data-frame)
# ttg                     : rows (genes) annotation

# Focus on CTRL samples for this session
sel_samples <- which(info$mutant=="CTRL")
myE_ge      <- myE_ge[,sel_samples]
info        <- info[sel_samples,]
info$group  <- factor(paste(info$Fraction,info$DIV,sep="_"),levels=unique(paste(info$Fraction,info$DIV,sep="_")))


#Make some nice colors to facilitate the visualisation of time-points
mytime                 <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days            <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days)     <- c(0,3,7,14,22,35)
mycols                 <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))

2 Get familiar with your data

How many samples is there in your count data?

Does this correspond to your experimental protocol?

Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file.

What are the covariates?

how many rows?

Check that the rowID of the data-count corresponds to the ensembl gene ID in your gene annotation file.

#View(info)
#nrow(info)
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
match(colnames(myE_ge),info$sampleID)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
sum(is.na(match(rownames(myE_ge),ttg$ens_gene)))

## [1] 0

How does your data look-like?

par(mfrow=c(2,3))
plot(density(myE_ge[,1]),main="Read count distribution in sample 1")
boxplot(myE_ge,las=1,ylab="raw gene count",main="with outliers")
boxplot(myE_ge,outline=FALSE,las=1,ylab="raw gene count",main="without outliers")
plot(density(t(myE_ge[1,])),main="Read count distribution across samples")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),las=1,ylab="raw gene count",main="with outliers")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),outline=FALSE,las=1,ylab="raw gene count",main="without outliers")

3 Pre-processing of the count data

3.1 Variance stabilisation with log-transformation

Analysis of the variance of the gene count across the $N$ samples: $\sigma^2=Var(X)=\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2$ and $\mu=\frac{1}{n}\sum_{i=1}^n$x_i:

Let’s have a close look at how the variance in gene expression scale with average and then correct it with some transformation.

#calculate mean and variance of the rows
row_avg<-apply(myE_ge,1,mean)
row_var <-apply(myE_ge,1,var)

#Log-transform your count data
myE_gel <- log2(1+myE_ge)



par(mfrow=c(1,2))
#Variance scale with average --> Poisson distribution
plot(row_avg,row_var,las=1,main="RAW data",pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="mean read count",ylab="variance read count",xlim=c(0,5000),ylim=c(0,10^7))
grid()

plot(apply(myE_gel,1,mean),apply(myE_gel,1,var),las=1,main="Log2-transformed data",pch=19,col=rgb(0,0,0,0.1),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()

3.2 Identification of reliably expressed genes

Let’s first have a look at the distribution of gene expression for a few samples:

par(mfrow=c(1,3))
geneplotter::multidensity(myE_gel[,c(1,2,4,10)],main="Read count distributions",las=1,xlab="read count [log2]")
plot(density(myE_gel[,1]),las=1,main=colnames(myE_gel)[1],las=1,xlab="read count [log2]")
plot(density(myE_gel[,3]),las=1,main=colnames(myE_gel)[3],las=1,xlab="read count [log2]")

We can identify reliably expressed genes by fitting a bimodal distribution to the log2-read count distribtution of each samples to discriminate between the foreground and the background transcription. The limit must be fitted to each individual sample given that the library size will impact this factor.

We can now select for the reliably expressed genes in each sample:

3.3 Normalisation

Let’s first look at the ditribution of the gene count across a few samples:

There are several options for normalisation. Scaling factors, quantile normalisation etc.

What is the effect on the gene count?

4 Unsupervised hierarchical clustering analysis of the samples

Hierarchical clustering of the samples is frequently used to analyse whether similar samples cluster together.

Let’s first compare the clustering of the samples using the Manhattan distance and Ward D algorithm:

Samples are very similarly clustered in Quantile and Deseq2 normalised methods. Let’s compare the different effect of the distances used as well as agglomerative alorithm. From now we will use the quantile normalised matrix:

BIO-463: Bulk RNA-sequencing Analysis

Raphaëlle Luisier

April 2025