1. Introduction to R

Author

EPFL - SV - BIO-463

Published

February 18, 2025

The R Programming Language

R is a programming language used for statistical analysis and data manipulation, widely used by several scientific communities which have contributed a large number of libraries. R is free (GNU license) and can be used to produce publication-ready graphics.

R binaries and additional packages can be downloaded from several sites:

Installation instructions on the CRAN site.
List of CRAN packages
The Bioconductor project (bioinformatics software)
The rdocumentation site

It can be used on the command-line in a terminal:

bash$ R
R version 4.4.2 (2024-10-31) -- "Pile of Leaves"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin20

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> 1+1
 [1] 2
> y = c(1, 4, 9, 16, 25)
> sqrt(y)
 [1] 1 2 3 4 5
> myFunc = function(x) {1+x^2}
> myFunc(1:10)
 [1]   2   5  10  17  26  37  50  65  82 101
> ?log
> q()

However we recommend using the RStudio Desktop, an IDE for R available at posit.co.

Installing packages

R comes with many additional packages that provide statistical methods for various types of data analysis. To install them you use either

install.packages (if it comes from CRAN)
BiocManager::install (if it comes from Bioconductor):

install.packages(c("BiocManager", "quarto"))
BiocManager::install("pheatmap")

Once a package is installed, you need to load it into your session with the command library:

BiocManager
library(BiocManager)

Exercise 1

The purpose of this exercise is to observe the effect of some common operations in R, and familiarize yourself with the language and the interface.

Try to change some of the commands and see the effect.

Open RStudio.
Create a “New project” (from the File menu), chose “Version Control” and “Git”, paste the URL https://gitlab.epfl.ch/genomics-and-bioinformatics/course-data-2025.git and chose the location on your computer to save it.
Alternatively you can clone the same gitlab repository into your working directory and open the directory from RStudio.
Open the file ExercisesWeek1.qmd in RStudio (this is the file used to generate the document you are currently reading…)
Run the following code blocks and understand what they are doing.

Read the data from the tab-delimited file GeneExpressionData.txt (open the file as well to have a look at its content):

data = read.delim("GeneExpressionData.txt", row.names=1)

If the file is not found, check your path and use setwd() to change to your working directory:

getwd()
##  setwd("/YOUR/PATH/TO/GITLAB/REPO")
dir()

First look at the data (notice that rows and columns have names!):

dim(data)
head(data)
data[1:4, ]
data$id

data$C1[1]
data$C2[3:10]
data["ATP2A3",]
vector = data$C1
vector[4]

Compute some basic statistics:

summary(data)
summary(data$C1)
mean(data$C2)
median(log(data$C2))
sapply(data[1:10,], min)
sapply(data, max)
apply(data, 1, mean)
apply(data, 2, sd)

?sd

Elementary data transformation (are all ratios well-defined?):

any(data$C2==0) 
which(data$C2==0)
ratios = log2(data$C1/data$C2)
geomMeans = sqrt(data$C1*data$C2)

Plot the data

plot(data$C1, data$C2, log='xy', pch=20, main='', xlab='C1', ylab='C2')
h1 = hist(log2(data$C1), breaks=30, main='', xlab='log2 values')
hist(log2(data$C2), br=h1$breaks, add=T, col=2)

If you would like to learn more about R, we suggest two online courses that are focused on Bioinformatics:

Exercise 2

In this exercise we will perform a typical gene expression analysis based on a dataset from Leukemia cells:

Load the dataset leukemiaExpressionSubset.rds (it is in compressed RDS format):

library(pheatmap)
data = readRDS("leukemiaExpressionSubset.rds")

In the file, samples (table columns) are named according to cell type and experiment number. Let us create an annotation table by splitting the sample type and the sample number in different columns:

colnames(data)
annotations = data.frame(
            LeukemiaType = substr(colnames(data),1,3),
            row.names = substr(colnames(data),10,13))
colnames(data) = rownames(annotations)

Log-transform the data, generate scatter plots of sample pairs and a boxplot of the distribution of gene expression values:

logdata = log2(data)
## calculate the median per column (dimension no 2)
meddata = apply(logdata, 2, median)
## next subtract this median from each column
logdata = sweep(logdata, 2, meddata, "-")
## choose a color for each cell type
typeCols = c("ALL"='red', "AML"='magenta', "CLL"='blue', "CML"='cyan', "NoL"='gray')
## set some graphical parameters and plot the data
par(las=1, cex=1.1, lwd=2, lty=1, pch=20)
pairs(logdata[,1:5])
boxplot(logdata, las=2, lty=1, lwd=2, col=typeCols[annotations$LeukemiaType], pch=20)

Create a clustered “heatmap” of the data:

pheatmap(logdata, show_rownames=F, annotation_col=annotations, scale='none', 
         clustering_distance_cols='correlation', clustering_method='complete',
         annotation_colors=list(LeukemiaType=typeCols))

Save the transformed data to a tab-delimited text file:

write.table(logdata, file = "testoutput.txt", sep="\t", quote=F)

Exercise 3

In this exercise we will seek information from genomic data portals mentionned in the lecture, and use them to inform our analysis.

We are interested in the gene BCL2A1 because it has been implicated in many cancers, including Leukemia. We would like to see if it displays some interesting signal in our data.

Go to the Ensembl site and search for the identifier of this gene (should replace the string ENSG00000XXXXXX in the code). Use this identifier to extract the corresponding row from the log-data matrix, and show that it is disregulated in acute leukemia (ALL, AML):

geneid = "ENSG00000XXXXXX"
bcl2a1_expression = as.numeric(logdata[geneid,])
boxplot(bcl2a1_expression~annotations$LeukemiaType)

Use the UCSC genome browser to answer the following questions:

Which strand is human BCL2A1 on?
How many splice variants (isoforms) exist according to the NCBI RefSeq and to GENCODE?
What is the next protein-coding gene upstream of BCL2A1? and downstream?
Can you find a binding site for NFKB1 (nuclear factor kappa B subunit 1, a transcription factor) less than 10kb upstream of BCL2A1, within a Dnase-1 hypersensitive site bearing an H3K27ac mark? Hint: look at the tracks ReMap ChIP-seq and Encode Regulation