In this tutorial, we will analyze single-cell RNA sequencing data
from Ho et. al,
Genome Research, 2018. They are derived from 451Lu melanoma cell
line in two different conditions: 1) parental (untreated) and 2)
resistant (treated for 6 weeks with BRAF inhibitors and yet
proliferating hence called resistant):
Importantly the differenceS between these two cell populations are 1) the exposure to treatment (none versus chronic); 2) the resistance that emerges from the treatment. Hence these two variables (treatment and resistance) are confounded by the experimental set-up. This tutorial is built on the Seurat package; you can refer to their vignettes for more details about the functions.
The data can be downloaded from ZENODO under https://zenodo.org/records/11032235.
load("./scRNA_data_melanoma.RData")
#Data:
# expression_matrix : 19386 x 8574 sparse Matrix of class "dgTMatrix" in triplet form https://www.rdocumentation.org/packages/Matrix/versions/1.6-5/topics/dgTMatrix-class
# cell_metadata : matrix of dimension n_cells X 2; col1=uniqueID of the cell; col2=variable ()
# gene_metadata : matrix of dimension n_genes X 1; simply the gene symbols of the rows of the expression matrices
mycols <- c("#6699FF", "#CC33CC")Letās check how many genes we have quantified and across how many cells:
## Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:29660859] 19 28 52 57 114 120 133 134 135 136 ...
## ..@ j : int [1:29660859] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ Dim : int [1:2] 19386 8574
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:19386] "OR4F5" "OR4F29" "AL645608.2" "SAMD11" ...
## .. ..$ : chr [1:8574] "AAACCTGAGCGTTGCC" "AAACCTGCAAACTGTC" "AAACCTGCAGCGTCCA" "AAACCTGCATATGAGA" ...
## ..@ x : num [1:29660859] 1 2 1 1 7 1 1 5 1 1 ...
## ..@ factors : list()
## [1] "n_genes= 19386"
## [1] "n_cells= 8574"
Letās look at how many cells are parental versus resistant:
##
## parental resistant
## 4452 4122
Letās now create a Seurat object. Please note that we do not filter out genes according to a pre-determined number of reads as we are going to automatically identify the correct number of reliable genes as previously performed in bulk gene expression.
Quality control (QC) is a critical step in single-cell RNA-seq (scRNA-seq) data analysis. Low-quality cells are removed from the analysis during the QC process to avoid misinterpretation of the data. Seurat object has two features:
High nCount_RNA and/or nFeature_RNA indicates that the a cell may in fact be a doublet (or multiplet). We can calculate two additional metrics for QC:
The QC process usually involves applying user-defined thresholds for different metrics computed for each individual cell to filter out doublets and ālow-qualityā cells Luecken et al, 2019. Cell filtering is performed according to several criteria that are going to be used sequentially.
Cells that are poor quality are likely to have low genes and UMIs per cell, as well as high mitochondrial counts ratio. Good cells will generally exhibit both higher number of genes per cell and higher numbers of UMIs. In single-cell RNA sequencing experiments, doublets are generated from two cells. They typically arise due to errors in cell sorting or capture, especially in droplet-based protocols involving thousands of cells.
layout(matrix(ncol=2,nrow=1,c(1:2),byrow = TRUE))
# Add number of genes per UMI for each cell to metadata
GE$log10GenesPerUMI <- log10(GE$nFeature_RNA) / log10(GE$nCount_RNA)
#another notation: GE[["log10GenesPerUMI"]] <- log10(GE$nFeature_RNA) / log10(GE$nCount_RNA)
# Compute percent mito ratio
GE$mitoRatio <- PercentageFeatureSet(object = GE, pattern = "^MT-")/100
#another notation: GE[["mitoRatio"]] <- PercentageFeatureSet(object = GE, pattern = "^MT-")/100
#Create subsets to facilitate the analysis
GE_parental <- subset(GE, subset = treatment == "parental")
GE_resistant <- subset(GE, subset = treatment == "resistant")
#Compare #genes versus UMI
smoothScatter(GE_parental$nFeature_RNA,GE_parental$nCount_RNA,las=1,main="Parental",xlab="# genes",ylab="# UMI")
smoothScatter(GE_resistant$nFeature_RNA,GE_resistant$nCount_RNA,las=1,main="Resistant",xlab="# genes",ylab="# UMI")The UMI counts per cell should generally be above 500, that is the low end of what we expect. If UMI counts are between 500-1000 counts, it is usable but the cells probably should have been sequenced more deeply.
layout(matrix(ncol=4,nrow=1,c(1,1,1,2),byrow = TRUE))
myUMI<- list(GE_parental$nCount_RNA,GE_resistant$nCount_RNA)
multidensity(myUMI,las=1,xlab="# UMI per cell",ylab="density",col=mycols,main="",leg=FALSE)
abline(v=8000)
myvals=unlist(lapply(myUMI,function(Z)return(sum(Z>8000)/length(Z))))
mp<- barplot(myvals,las=1,col=mycols,ylab="% selected cells",ylim=c(0,1))
mtext(side=3,line=0,text=round(myvals,digit=2),cex=0.6,at=mp)For high quality data, the proportional histogram should contain a single large peak that represents cells that were encapsulated. If we see a small shoulder to the right of the major peak or a bimodal distribution of the cells, that can indicate a couple of things:
Therefore, this threshold should be assessed with other metrics included in this tutorial. Source: hbctraining.github.io. In order to automatically identify the selection threshold, we can fit a bimodal distribution and select 99% of the foreground.
# 1 Fit bimodal distribution
bimdens <- mclust::densityMclust(data=GE$nFeature_RNA,G=2,plot=FALSE)
# 2. Identify limit that discriminate foreground from background
lim <- qnorm(0.99,mean=bimdens$parameters$mean[1],sd=sqrt(bimdens$parameters$variance$sigmasq[1]))
# Visualise those values
layout(matrix(ncol=5,nrow=1,c(1,1,1,2,3),byrow = TRUE))
myGenes<- list(GE_parental$nFeature_RNA,GE_resistant$nFeature_RNA)
multidensity(myGenes,las=1,xlab="# genes per cell",ylab="density",col=mycols,main="",leg=FALSE)
abline(v=lim)
myvals=unlist(lapply(myGenes,function(Z)return(sum(Z>2000)/length(Z))))
mp<- barplot(myvals,las=1,col=mycols,ylab="% selected (F_genes) cells",ylim=c(0,1))
mtext(side=3,line=0,text=round(myvals,digit=2),cex=0.6,at=mp)
sel_Genes<- GE$nFeature_RNA>2000
myvals=c(sum(sel_Genes&sel_UMI&GE$treatment=="parental")/sum(GE$treatment=="parental"),
sum(sel_Genes&sel_UMI&GE$treatment=="resistant")/sum(GE$treatment=="resistant"))
mp<-barplot(myvals,las=1,col=mycols,ylab="% selected (F_genes&UMI) cells",ylim=c(0,1))
mtext(side=3,line=0,text=round(myvals,digit=2),cex=0.6,at=mp)This metric can identify whether there is a large amount of mitochondrial contamination from dead or dying cells. Early publications in the field established a threshold of 5% and since then, it has been used as a default in several software packages for scRNA-seq data analysis, and adopted as a standard in many scRNA-seq studies. However, the validity of using a uniform threshold across different species, single-cell technologies, tissues and cell types has not been adequately assessed Osorio et al.Ā (2021). Other defined poor quality samples for mitochondrial counts as cells which surpass the 0.2 mitochondrial ratio mark.Source: hbctraining.github.io
par(mfrow=c(1,1),mar=c(10,20,10,5))
# Create a data frame with the treatment and percent.mt values
data_mt <- data.frame(Treatment = GE@meta.data$treatment, PercentMT = GE@meta.data$mitoRatio)
# Create the boxplot
ggplot(data_mt, aes(x = Treatment, y = PercentMT, fill = Treatment)) +
geom_boxplot() + scale_fill_manual(values = mycols) +
theme_minimal() +
labs(title = "Boxplot of percent.mt per treatment", x = "Treatment", y = "Percent.mt")We can now test what would be the optimal threshold values for each individual samples by using the IQR values (identification of outliers).
# Compute the upper quartile and the interquartile range of the % of Mitocondrial read counts
Q3_par <- quantile(data_mt[which(data_mt$Treatment=='parental'),]$PercentMT, 0.75)
IQR_par <- IQR(data_mt[which(data_mt$Treatment=='parental'),]$PercentMT)
Q3_res <- quantile(data_mt[which(data_mt$Treatment=='resistant'),]$PercentMT, 0.75)
IQR_res <- IQR(data_mt[which(data_mt$Treatment=='resistant'),]$PercentMT)
# Define the upper limit for outliers
upper_limit_par <- Q3_par + 1.5 * IQR_par
upper_limit_res <- Q3_res + 1.5 * IQR_reslayout(matrix(ncol=5,nrow=1,c(1,1,1,2,3),byrow = TRUE))
myGenes<- list(GE_parental$mitoRatio,GE_resistant$mitoRatio)
multidensity(myGenes,las=1,xlab="ratio mitochondrial genes",ylab="density",col=mycols,main="",leg=TRUE,xlim=c(0,0.1))
abline(v=c(upper_limit_par,upper_limit_res),col=mycols,lty=2)
myLims=c(upper_limit_par,upper_limit_res)
myvals=unlist(lapply(c(1,2),function(IX)return(sum(myGenes[[IX]]<myLims[IX])/length(myGenes[[IX]]))))
mp<- barplot(myvals,las=1,col=mycols,ylab="% selected (F_mito) cells",ylim=c(0,1))
mtext(side=3,line=0,text=round(myvals,digit=2),cex=0.6,at=mp)
sel_Mito<- GE$mitoRatio<myLims[1]&GE$treatment=="parental"|GE$mitoRatio<myLims[2]&GE$treatment=="resistant"
myvals<- c(sum(sel_Genes&sel_UMI&sel_Mito&GE$treatment=="parental")/sum(GE$treatment=="parental"),
sum(sel_Genes&sel_UMI&sel_Mito&GE$treatment=="resistant")/sum(GE$treatment=="resistant"))
mp<-barplot(myvals,las=1,col=mycols,ylab="% selected (F_genes&UMI&Mito) cells",ylim=c(0,1))
mtext(side=3,line=0,text=round(myvals,digit=2),cex=0.6,at=mp)The number of genes detected per UMI gives an idea of the complexity of the data-set: the more genes detected per UMI, the more complex the data. Sometimes we can detect contamination with low complexity cell types like red blood cells via this metric. Generally, we expect the novelty score to be above 0.80.
layout(matrix(ncol=5,nrow=1,c(1,1,1,2,3),byrow = TRUE))
myGenes<- list(GE_parental$log10GenesPerUMI,GE_resistant$log10GenesPerUMI)
names(myGenes) <- c("parental","resistant")
multidensity(myGenes,las=1,xlab="log10GenesperUMI",ylab="density",col=mycols,main="",xlim=c(0.6,1.0))
abline(v=0.8)
myvals=unlist(lapply(myGenes,function(Z)return(sum(Z>0.08)/length(Z))))
mp<- barplot(myvals,las=1,col=mycols,ylab="% selected (F_complexity) cells",ylim=c(0,1))
mtext(side=3,line=0,text=round(myvals,digit=2),cex=0.6,at=mp)
sel_complex<- GE$log10GenesPerUMI>0.8
myvals <- c(sum(sel_Genes&sel_UMI&sel_Mito&sel_complex&GE$treatment=="parental")/sum(GE$treatment=="parental"),
sum(sel_Genes&sel_UMI&sel_Mito&sel_complex&GE$treatment=="resistant")/sum(GE$treatment=="resistant"))
mp <-barplot(myvals,las=1,col=mycols,ylab="% selected (F_genes&UMI&Mito&Complex) cells",ylim=c(0,1))
mtext(side=3,line=0,text=round(myvals,digit=2),cex=0.6,at=mp)It is often the case that UMIs and genes detected are evaluated together. Below is a scatter plot showing the number of genes versus the numnber of UMIs per cell coloured by the fraction of mitochondrial reads. Mitochondrial read fractions are only high (light blue color) in particularly low count cells with few detected genes.
layout(matrix(ncol=2,nrow=1,c(1:2),byrow = TRUE))
GE_parental <- subset(GE, subset = treatment == "parental")
GE_resistant <- subset(GE, subset = treatment == "resistant")
plot(GE_parental$nFeature_RNA,GE_parental$nCount_RNA,las=1,main="Parental",xlab="# genes",ylab="# UMI",pch=19,cex=0.1,col=rgb(0,0,0,0.1),frame=FALSE)
points(GE_parental$nFeature_RNA[GE_parental$mitoRatio>0.2],GE_parental$nCount_RNA[GE_parental$mitoRatio>0.2],col="red",pch=19,cex=0.3)
mtext(side=3,line=0,text=paste("# not passed per mito =",sum(GE_parental$mitoRatio>0.2)),cex=0.8)
plot(GE_resistant$nFeature_RNA,GE_resistant$nCount_RNA,las=1,main="Resistant",xlab="# genes",ylab="# UMI",pch=19,cex=0.1,col=rgb(0,0,0,0.1),frame=FALSE)
points(GE_resistant$nFeature_RNA[GE_resistant$mitoRatio>0.2],GE_resistant$nCount_RNA[GE_resistant$mitoRatio>0.2],col="red",pch=19,cex=0.3)
mtext(side=3,line=0,text=paste("# not passed per mito =",sum(GE_resistant$mitoRatio>0.2)),cex=0.8)layout(matrix(ncol=2,nrow=1,c(1:2),byrow = TRUE))
smoothScatter(GE_parental$log10GenesPerUMI,GE_parental$mitoRatio,las=1,main="Parental",xlab="genes per UMI",ylab="mitochondiral ratio",ylim=c(0,0.1))
smoothScatter(GE_resistant$log10GenesPerUMI,GE_resistant$mitoRatio,las=1,main="Resistant",xlab="genes per UMI",ylab="mitochondiral ratio",ylim=c(0,0.1))Considering any of the QC metrics in isolation can lead to misinterpretation of cellular signals. For example, cells with a comparatively high fraction of mitochondrial counts may be involved in respiratory processes and may be cells that you would like to keep. Likewise, other metrics can have other biological interpretations. Thus, always consider the joint effects of these metrics when setting thresholds and set them to be as permissive as possible to avoid filtering out viable cell populations unintentionally.
In summary here we are going to use the following thresholds that need to be carefully considered for each experiment/data-set:
layout(matrix(ncol=1,nrow=1,1,byrow = TRUE))
VlnPlot(GE, features = c("nFeature_RNA", "nCount_RNA", "log10GenesPerUMI","mitoRatio"), ncol = 4)The following plot shows the distribution of the metrics prior filtering:
plot1 <- FeatureScatter(GE, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", group.by = "treatment", cols = mycols)
plot2 <- FeatureScatter(GE, feature1 = "nCount_RNA", feature2 = "mitoRatio", group.by = "treatment", cols = mycols)
plot1 + plot2## [1] "Before any filtering, we have 8574 cells."
## [1] "4452 parental cells"
## [1] "4122 resistant cells"
#Cell counts
GE <- subset(GE, subset = nFeature_RNA >= lim)
print(paste("After removing cells with low gene counts, we end up with", as.character(dim(GE)[2]), "cells."))## [1] "After removing cells with low gene counts, we end up with 6645 cells."
# UMIs
GE <- subset(GE, subset = nCount_RNA>500)
print(paste("After removing cells with low UMI, we end up with", as.character(dim(GE)[2]), "cells."))## [1] "After removing cells with low UMI, we end up with 6645 cells."
#Mitochondrial filter
GE <- subset(GE, subset = (treatment == "parental" & mitoRatio > upper_limit_par), invert=T)
print(paste("After removing cells with high MT in parental, we end up with", as.character(dim(GE)[2]), "cells."))## [1] "After removing cells with high MT in parental, we end up with 6622 cells."
GE <- subset(GE, subset = (treatment == "resistant" & mitoRatio > upper_limit_res), invert=T)
print(paste("After removing cells with high MT in resistant, we end up with", as.character(dim(GE)[2]), "cells."))## [1] "After removing cells with high MT in resistant, we end up with 6567 cells."
#Perplexity
GE <- subset(GE, subset = log10GenesPerUMI > 0.8)
print(paste("After removing cells with low MT complexity, we end up with", as.character(dim(GE)[2]), "cells."))## [1] "After removing cells with low MT complexity, we end up with 6414 cells."
## [1] "We end up with 6414 cells:"
## [1] "3321 parental cells"
## [1] "3093 resistant cells."
The following plot shows the distribution of the metrics after filtering:
plot1 <- FeatureScatter(GE, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", group.by = "treatment", cols = mycols)
plot2 <- FeatureScatter(GE, feature1 = "nCount_RNA", feature2 = "mitoRatio", group.by = "treatment", cols = mycols)
plot1 + plot2The following figure shows the total number of genes with more than one read count per cell. As we can see, the majority of cells do not express more than 3000 genes out of the 20,000 studied.
layout(matrix(ncol=3,nrow=1,c(1,1,2),byrow = TRUE))
count_matrix <- GE@assays$RNA@layers$counts
# Convert the count matrix to binary (1 if a gene is expressed in a cell, 0 otherwise)
binary_matrix <- as.matrix(count_matrix > 0)
# Compute the number of cells expressing each gene
cells_per_gene <- Matrix::rowSums(binary_matrix)
# Print the result
hist(cells_per_gene, main="Distribution of the number of cells per genes", ylab="Number of genes", xlab="Number of cells", breaks=30)
boxplot(cells_per_gene,outline=FALSE,las=1)Single-cell data have a very low sensitivity and therefore the data are very sparse with many genes with zero counts. It is common to filter out these genes, similarly to what we previously did with bulk RNA-sequencing data. Obviously genes with zero count across all cells are filtered out. Additionally many studies select genes with read count above a pre-defined threshold for example selecting genes which are expressed in 10 or more cells. However we propose an alternative based on our previous work on bulk. As performed with bulk RNA-sequencing data, where non-reliably expressed genes were filtered out in order to remove non-relevant genes that can add noise to the data, we will apply the same logic here.
We first create pseudo-bulk RNA-seq data of dimension \(g\times2\) where \(g\) is the number of genes across two groups (parental and resistant).
layout(matrix(ncol=3,nrow=1,c(1,2,2),byrow = TRUE))
psbulk_parental <- log2(rowSums(subset(GE, subset = treatment == "parental")@assays$RNA@layers$counts) + 1)
psbulk_resistant <- log2(rowSums(subset(GE, subset = treatment == "resistant")@assays$RNA@layers$counts) + 1)
boxplot(list(parental=psbulk_parental,resistant=psbulk_resistant),outline=FALSE,col=mycols,las=1,ylab="log2 read count")
multidensity(list(parental=psbulk_parental,resistant=psbulk_resistant),col=mycols,las=1,xlab="log2 read count")Letās now fit a bimodal distribution to each sample group to select a threshold of selection:
par(mfrow=c(1,2),mar=c(10,5,10,2))
# 1 Fit bimodal distribution
bimden <- mclust::densityMclust(data=psbulk_parental,G=2,plot=FALSE)
lim_parental <- qnorm(0.9,mean=bimden$parameters$mean[1],sd=sqrt(bimden$parameters$variance$sigmasq[1]))
bimden <- mclust::densityMclust(data=psbulk_resistant,G=2,plot=FALSE)
lim_resistant <- qnorm(0.9,mean=bimden$parameters$mean[1],sd=sqrt(bimden$parameters$variance$sigmasq[1]))
plot1 <- hist(psbulk_parental, breaks=50, col=mycols[1], main="Pseudo-bulk expression -- parental", xlab="log2(count+1)", ylab="Number of genes")
abline(v=lim_parental, col='red')
plot2 <- hist(psbulk_resistant, breaks=50, col=mycols[2], main="Pseudo-bulk expression -- resistant", xlab="log2(count+1)", ylab="Number of genes")
abline(v=lim_resistant, col='red')# Filter according to the found thresholds
indexes_parental <- which(psbulk_parental > lim_parental)
indexes_resistant <- which(psbulk_resistant > lim_resistant)
# We keep a gene if reliably expressed in one of the two conditions
GE <- GE[union(indexes_parental, indexes_resistant),]
psbulk_parental <- log2(rowSums(subset(GE, subset = treatment == "parental")@assays$RNA@layers$counts) + 1)
psbulk_resistant <- log2(rowSums(subset(GE, subset = treatment == "resistant")@assays$RNA@layers$counts) + 1)
print(paste("After the filtering based on pseudo-bulk expression, we have", as.character(dim(GE)[1]), "genes."))## [1] "After the filtering based on pseudo-bulk expression, we have 12517 genes."
The goal of normalization is to account for observed differences in measurements between samples and/or features (e.g., genes) resulting from technical artifacts or unwanted biological effects (e.g., batch effects) rather than biological effects of interest.
Normalization of scRNA-seq data is often accomplished via methods developed for bulk RNA-seq or even microarray data. These methods tend to neglect prominent features of scRNA-seq data such as Cole et al.Ā Cell Systems, 2019:
Seurat proposes different normalisation methods that will be tested here and compared using the distributions of average gene expression per cell and per gene:
LogNormalize: feature counts for each cell are divided
by the total counts for that cell and multiplied by the scale.factor.
This is then natural-log transformed using log1p.CLR: Applies a centered log ratio transformationRC: Relative counts. Feature counts for each cell are
divided by the total counts for that cell and multiplied by the
scale.factor. No log-transformation is applied. For counts per million
(CPM) set scale.factor = 1e6SCTransform= modeling framework for the normalization
and variance stabilization of molecular count data from scRNA-seq
experiments. This procedure omits the need for heuristic steps including
pseudocount addition or log-transformation and improves common
downstream analytical tasks such as variable gene selection, dimensional
reduction, and differential expression. Please look at the full vignette
here.We also propose to test with the Qantile Normalisation which has been previously proposed in protocols lacking UMI counts (Townes et al.Ā Genome Biology (2020))[https://link.springer.com/article/10.1186/s13059-020-02078-0].
These steps can take few minutes and therefore we stored the objects
in the .RData file. However if you want to run it on new data, change
eval=TRUE.
## [1] 12526 6414
#Seurat objects:
#GE : raw data
#GE_log : log-normalise
#GE_clr : CLR normalisation
#GE_rc : RC normalisation
#GE_sct : SCTtransformed
#GE_qn : quantile normalised# Take a few minutes to run; therefore stored in the RData object
# To make it active simply change eval to TRUE
GE_log <- NormalizeData(GE, normalization.method = "LogNormalize", scale.factor = 1e6)
GE_clr <- NormalizeData(GE, normalization.method = "CLR")
GE_rc <- NormalizeData(GE, normalization.method = "RC")
GE_sct <- SCTransform(GE)
#There are two matrices in the Seurat object
# 1) GE_sct@assays$SCT/RNA@data; this is the normalised counts; and rownames(as.matrix(GE_sct@assays$SCT@data)) get the GeneSymbol
# 2) GE_sct@assays$RNA@layers$counts however this one is not containing the rownames; this is the raw counts
#Extract count matrix and normalise
q_norm <- 2^limma::normalizeQuantiles(log2(1+GE@assays$RNA@layers$counts))
q_norm <- Matrix(round(q_norm), sparse = TRUE) #create sparse matrix
rownames(q_norm) <- rownames(GE)
colnames(q_norm) <- colnames(GE)
GE_qn <- GE
GE_qn@assays$RNA@layers$data <- q_norm
#Save all objects
save(list=c("GE","GE_log","GE_clr","GE_rc","GE_sct","GE_qn"),file="./normalised_seurat_objects.RData")We can now assess the effect of individual normalisation methods on the distribution of gene expression in each sample/cell first. As shown below, the different methods have distinct effect.
load("./normalised_seurat_objects.RData")
layout(matrix(ncol=3,nrow=2,c(1:6),byrow = TRUE))
myidx <- sample(c(1:ncol(count_matrix)),size=30)
multidensity(log2(1+as.matrix(GE@assays$RNA@layers$counts)[,myidx]),outline=FALSE,leg=FALSE,xlab="read count [log2]",main="Not normalised",las=1)
multidensity(as.matrix(GE_log@assays$RNA@layers$data[,myidx]),outline=FALSE,leg=FALSE,main="LogNormalize",xlab="read count [log2]",las=1)
multidensity(as.matrix(GE_clr@assays$RNA@layers$data)[,myidx],outline=FALSE,leg=FALSE,main="CLR",xlab="read count [log2]",las=1)
multidensity(log2(1+as.matrix(GE_rc@assays$RNA@layers$data))[,myidx],outline=FALSE,leg=FALSE,main="RC",xlab="read count [log2]",las=1)
multidensity(as.matrix(GE_sct@assays$SCT@data)[,myidx],outline=FALSE,leg=FALSE,main="SCTransform",xlab="read count [log2]",las=1)
multidensity(log2(1+as.matrix(GE_qn@assays$RNA@layers$data)[,myidx]),outline=FALSE,leg=FALSE,main="Quantile",xlab="read count [log2]",las=1)layout(matrix(ncol=3,nrow=2,c(1:6),byrow = TRUE))
boxplot(log2(1+as.matrix(GE@assays$RNA@layers$counts)[,myidx]),outline=FALSE,leg=FALSE,ylab="read count [log2]",main="Not normalised",las=1)
boxplot(as.matrix(GE_log@assays$RNA@layers$data[,myidx]),outline=FALSE,leg=FALSE,main="LogNormalize",ylab="read count [log2]",las=1)
boxplot(as.matrix(GE_clr@assays$RNA@layers$data)[,myidx],outline=FALSE,leg=FALSE,main="CLR",ylab="read count [log2]",las=1)
boxplot(log2(1+as.matrix(GE_rc@assays$RNA@layers$data))[,myidx],outline=FALSE,leg=FALSE,main="RC",ylab="read count [log2]",las=1)
boxplot(as.matrix(GE_sct@assays$SCT@data)[,myidx],outline=FALSE,leg=FALSE,main="SCTransform",ylab="read count [log2]",las=1)
boxplot(log2(1+as.matrix(GE_qn@assays$RNA@layers$data)[,myidx]),outline=FALSE,leg=FALSE,main="Quantile",ylab="read count [log2]",las=1)We will first investigate how the variance in gene expression across the cells varies with the average gene expression. As shown below the different methods reach very different results yet none of this exhibit the expected variance stabilisation expected.
layout(matrix(ncol=3,nrow=2,c(1:6),byrow = TRUE))
# Calculate average gene expression per cell
# Do not consider the zeros
#Raw count in log2
avg_GE_raw <- apply(log2(1+as.matrix(GE@assays$RNA@layers$counts)),1,function(Z)return(mean(Z[Z>0])))
var_GE_raw <- apply(log2(1+as.matrix(GE@assays$RNA@layers$counts)),1,function(Z)return(var(Z[Z>0])))
names(var_GE_raw) <- rownames(GE)
smoothScatter(avg_GE_raw,var_GE_raw,las=1,xlab="average read count across cells [log2]",ylab="variance read count across cells [log2]",main="Raw count")
grid()
#LogNormalize
avg_GE_log <- apply(as.matrix(GE_log@assays$RNA@layers$data),1,function(Z)return(mean(Z[Z>0])))
var_GE_log <- apply(as.matrix(GE_log@assays$RNA@layers$data),1,function(Z)return(var(Z[Z>0])))
names(var_GE_log) <- rownames(GE_log)
smoothScatter(avg_GE_log,var_GE_log,las=1,xlab="average read count across cells [log2]",ylab="variance read count across cells [log2]",main="Log Normalise")
grid()
#CLR
avg_GE_clr <- apply(as.matrix(GE_clr@assays$RNA@layers$data),1,function(Z)return(mean(Z[Z>0])))
var_GE_clr <- apply(as.matrix(GE_clr@assays$RNA@layers$data),1,function(Z)return(var(Z[Z>0])))
names(var_GE_clr) <- rownames(GE_clr)
smoothScatter(avg_GE_clr,var_GE_clr,las=1,xlab="average read count across cells [log2]",ylab="variance read count across cells [log2]",main="CLR")
grid()
#RC
avg_GE_rc <- apply(log2(1+as.matrix(GE_rc@assays$RNA@layers$data)),1,function(Z)return(mean(Z[Z>0])))
var_GE_rc <- apply(log2(1+as.matrix(GE_rc@assays$RNA@layers$data)),1,function(Z)return(var(Z[Z>0])))
names(var_GE_rc) <- rownames(GE_rc@assays$RNA@layers$data)
smoothScatter(avg_GE_rc,var_GE_rc,las=1,xlab="average read count across cells [log2]",ylab="variance read count across cells [log2]",main="RC")
grid()
#SCT
avg_GE_sct <- apply(as.matrix(GE_sct@assays$SCT@data),1,function(Z)return(mean(Z[Z>0])))
var_GE_sct <- apply(as.matrix(GE_sct@assays$SCT@data),1,function(Z)return(var(Z[Z>0])))
names(var_GE_sct) <- rownames(GE_sct)
smoothScatter(avg_GE_sct,var_GE_sct,las=1,xlab="average read count across cells [log2]",ylab="variance read count across cells [log2]",main="SCT")
grid()
# Quantile Normalised
avg_GE_ql <- apply(log2(1+as.matrix(GE_qn@assays$RNA@layers$data)),1,function(Z)return(mean(Z[Z>0])))
var_GE_ql <- apply(log2(1+as.matrix(GE_qn@assays$RNA@layers$data)),1,function(Z)return(var(Z[Z>0])))
names(var_GE_ql) <- rownames(GE_qn)
smoothScatter(avg_GE_ql,var_GE_ql,las=1,xlab="average read count across cells [log2]",ylab="variance read count across cells [log2]",main="Quantile")
grid() We can as well calculate the average cell expression profile in order to compare between treatment and control cells.
#Raw count in log2
avg_CE_raw <- apply(log2(1+as.matrix(GE@assays$RNA@layers$counts)),2,function(Z)return(mean(Z[Z>0])))
#LogNormalize
avg_CE_log <- apply(as.matrix(GE_log@assays$RNA@layers$data),2,function(Z)return(mean(Z[Z>0])))
#An alternative way to do this is the following:
#avg_CE_logp <- tapply(GE_log@assays$RNA@layers$data@x,col(GE_log@assays$RNA@layers$data)[(!GE_log@assays$RNA@layers$data==0)@x],mean)
#CLR
avg_CE_clr <- apply(as.matrix(GE_clr@assays$RNA@layers$data),2,function(Z)return(mean(Z[Z>0])))
#RC
avg_CE_rc <- apply(log2(1+as.matrix(GE_rc@assays$RNA@layers$data)),2,function(Z)return(mean(Z[Z>0])))
#SCT
avg_CE_sct <- apply(as.matrix(GE_sct@assays$SCT@data),2,function(Z)return(mean(Z[Z>0],na.rm=TRUE)))
# Quantile Normalised
avg_CE_ql <- apply(log2(1+as.matrix(GE_qn@assays$RNA@layers$data)),2,function(Z)return(mean(Z[Z>0])))
# Create a list of average gene expressions
avg_GEs <- list("Raw"=avg_CE_raw,"LogNormalize" = avg_CE_log, "CLR" = avg_CE_clr, "RC" = avg_CE_rc, "SCTransform" = avg_CE_sct,"Quantile"=avg_CE_ql)
layout(matrix(ncol=3,nrow=2,c(1:6),byrow = TRUE))
for(i in 1:length(avg_GEs)) {
multidensity(list(parental=avg_GEs[[i]][which(GE$treatment == "parental")],
resistant=avg_GEs[[i]][which(GE$treatment == "resistant")]),
col=mycols,las=1,main=names(avg_GEs[i]),, xlab="Average Gene Expression")
}#adapted from: http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization
library(reshape2)
cormat <- round(cor(do.call(what=cbind,args=avg_GEs),method="pearson"),2)
melted_cormat <- melt(cormat)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
#Reorder the correlation matrix
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+ # minimal theme
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
# Print the heatmap
ggheatmapHere we choose to keep 10ā000 genes, because we donāt want to loose to much biological information.
layout(matrix(ncol=2,nrow=1,c(1:2),byrow = TRUE))
#Identify top 10'000 genes
GE_log <- FindVariableFeatures(GE_log, selection.method = "vst", nfeatures = 10000)
#GE_clr <- FindVariableFeatures(GE_clr, selection.method = "vst", nfeatures = 10000)
#GE_rc <- FindVariableFeatures(GE_rc, selection.method = "vst", nfeatures = 10000)
#GE_sct <- FindVariableFeatures(GE_sct, selection.method = "vst", nfeatures = 10000)
#GE_qn <- FindVariableFeatures(GE_qn, selection.method = "vst", nfeatures = 10000)
# Identify the 10 most highly variable genes
top10_log <- head(VariableFeatures(GE_log), 10)
#top10_clr <- head(VariableFeatures(GE_clr), 10)
#top10_rc <- head(VariableFeatures(GE_rc), 10)
#top10_sct <- head(VariableFeatures(GE_sct), 10)
# Plot variable features with and without labels
plot1 <- VariableFeaturePlot(GE_log)
plot2 <- LabelPoints(plot = plot1, points = top10_log, repel = TRUE)
plot1 + plot2Prior to unsupervised clustering, dimensionality reduction is often applied on the most informative genes/features. As performed in week 9, we can use different methods for this task.
Selection of top 1000 variable genes.
Prior performing the PCA we first need to scale and center the genes.
layout(matrix(ncol=3,nrow=2,c(1:6),byrow = TRUE))
# Scale and Center gene features
GE_log <- ScaleData(GE_log, features =rownames(GE_log))
#Run PCA
GE_log <- RunPCA(GE_log, features = VariableFeatures(object = GE_log))
#Run t-SNE
GE_log <- RunTSNE(GE_log, dims = 1:10,perplexity=10)
#Run UMAP
GE_log <- RunUMAP(GE_log, dims = 1:10,n.neighbors=5,min.dist=0.4,metric="man")PCA plot:
pt<-DimPlot(GE_log, reduction = "pca", group.by = "treatment", pt.size=1, cols = mycols,dims = c(1, 2),alpha=0.2) +ggtitle("PCA")
ptt-SNE plot:
pt<-DimPlot(GE_log, reduction = "tsne", group.by = "treatment", pt.size=1, cols = mycols,dims = c(1, 2),alpha=0.2) +ggtitle("t-SNE")
ptpt<-DimPlot(GE_log, reduction = "umap", group.by = "treatment", pt.size=1, cols = mycols,alpha=0.2)+ggtitle("UMAP")
ptLetās have a look at the PCA loading for the first 6 PCs in two different methods:
layout(matrix(ncol=3,nrow=2,c(1:6),byrow = TRUE))
# Assuming that 'seurat_object' is your Seurat object and 'treatment' is your metadata column for treatments
pca_data <- data.frame(GE_log@reductions$pca@cell.embeddings[,1:6],
Treatment = GE_log@meta.data$treatment) # Treatment labels
for (i in 1:6){
boxplot(list(parental=pca_data[pca_data["Treatment"] == "parental", ][,i],resistant=pca_data[pca_data["Treatment"] == "resistant", ][,i]),col=mycols,las=1,ylab=paste("PC", as.character(i)),main=paste("PC", as.character(i)), outline=FALSE)
}The following heatmaps show the PCA loading for the most positively and negatively contributing genes in individual cells. The idea is that as long as weāre seeing clear structure in one of these plots then weāre still adding potentially useful information to the analysis. Both cells and genes are sorted by their principal component scores. It allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Setting cells to a number plots the āextremeā cells on both ends of the spectrum.