1. Introduction to R - solutions

Author

EPFL - SV - BIO-463

Published

February 18, 2025

Exercise 3

We are interested in the gene BCL2A1 because it has been implicated in many cancers, including Leukemia. We would like to see if it displays some interesting signal in our data.

The Ensemble identifier of this gene is ENSG00000140379. Use this identifier to extract the corresponding row from the log-data matrix, and show that it is disregulated in acute leukemia (ALL, AML):

### reload the data saved during the exercise
logdata = read.table("testoutput.txt")
data = readRDS("leukemiaExpressionSubset.rds")
annotations = data.frame(LeukemiaType = substr(colnames(data),1,3),
                         row.names = substr(colnames(data),10,13))
geneid = "ENSG00000140379"
bcl2a1_expression = as.numeric(logdata[geneid,])
boxplot(bcl2a1_expression~annotations$LeukemiaType)

Using the UCSC genome browser we find:

  1. Human BCL2A1 is on the reverse strand
  2. There are 2 isoforms according to the NCBI RefSeq and 3 according to GENCODE
  3. The next protein-coding gene upstream of BCL2A1 (in the direction of transcription) is ZFAND6, and downstream is MTHFS.

  1. There is a binding site for NFKB1 (nuclear factor kappa B subunit 1, a transcription factor) less than 10kb upstream of BCL2A1, within a Dnase-1 hypersensitive site bearing an H3K27ac mark: