Introduction

As many bench scientists know, experiments don’t always go as expected. So it is important to implement controls and sanity checks along the way to make sure that your results truly represent what you are trying to measure. The same goes for sequencing experiments.

In this section, I will introduce methods for quality checks after both sequencing and alignment.

Our goals for this section include:

  • Quality checks to ensure that sequencing was performed as expected.
    • Did we get the expected read lengths?
    • Was there any contamination
  • Quality checks for a proper alignment
    • What percentage of our reads mapped to our genome?
    • Did our reads map uniquely to the genome?

Check Sequencing QC

Anatomy of a Fastq file

Typically, bioinformaticians receive the sequencing results from Next Generation Sequencing experiments as FASTQ files. These are text files that contain the sequencing data from the clusters that pass a quality filter on a flow cell. Below is an example of a single read in FASTQ format:

The first header line is the read name. Most sequencing facilities (including the GECF) include information about the machine name, the run number, the flowcell and the location of the cluster on the flowcell. You can also find information about paired-end reads and the index sequence.

The second line includes the read sequence and should be made up of A,T,G,C and N.

The third line is a place holder .

And the final line is the quality of each base in the read in ASCII Base 33 format.

You can read more about the ASCII Base format and how to interpret it here: https://www.drive5.com/usearch/manual/quality_score.html

Running FastQC

FastQC is the most common tool for quality control checks on raw sequencing data coming from high throughput sequencing pipelines. It outputs easy-to-read files so that you can quickly see whether or not you need to implement additional steps before further analysis.

We use FastQC to check raw sequencing files before alignment, so we run the code on the fastq files. To run FastQC, we can simply point to our directory of interest and loop through our files:

## Make fastqc output directory
mkdir ./data/fastqc/

## Run Fastqc on a single file:
fastqc SRR0000000.fastq.gz -o ./data/fastqc/

## Run Fastqc on multiple files
for fl in ./data/fastq/*fastq.gz 
do
  fastqc $fl -o ./data/fastqc/
done

This can take some time to run so we won’t execute this code for now. However, let’s discuss some of the output results.

Interpreting the FastQC output

FastQC outputs a zip file with all result images and text as well as an html report that summarizes these results.

At the head of the html file, we have “Basic Statistics”. This gives information about which file we are looking at, whether or not the files includes base calls or colorspace data (not as common), and which type of sequencer was used to help interpret quality value encoding. We can also see the total number of reads (38 million), whether or not there are any flags, and sequence length (which hopefully approximately matches what you asked for from the facility - in this case, it appears that the facility has already performed a trimming step.

We also like to look at per base sequence quality. This plot shows the average base quality across the length of the read. In untrimmed files there is sometimes a decrease in quality at the beginnings or ends of reads. However, since these reads have already been trimmed by the facility, they have a high quality across the entire read. If there part of the reads are in the yellow or red, you may need to trim the reads (discussed late).

The overall data quality is confirmed with the second figure that shows that most of the reads are high quality (mapq = 40):

In some cases with FastQC, we get warnings and failures that we don’t have to worry about. For example, our “Per Base Sequence Content” failed. In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other. The relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other. However, RNA-seq libraries are produced by priming with random hexamers and fragmented, which can lead to a nucleotide bias, particularly at the beginnings of reads. This is a technical bias and does not seem to adversely affect downstream analysis.

We could also see in the Basic Statistics that there is some variability in the sequence length (the GECF trimming only keeps reads longer than 25 nt). In the “Sequence Length Distribution” section we can see that almost all of the reads were greater than 90nt long. Note that this will still give a warning because all sequences are not the same length.

In a diverse library (such as that expected in RNA-seq analyses), most sequences should occur only once. A high level of duplication could indicate an enrichment bias (often seen in low-input libraries where there could be a PCR over-amplification) or a maxing out of mapping locations (libraries with high coverage). This figure shows us what percentage of sequences may have a high number of reads that occur multiple times. The blue line indicates the duplication in the raw file whereas the red line shows how the data would likely look after a deduplication step. In this case, our files fail because more than 50% of the total reads are duplicated. Typically in bulk RNA-seq, we don’t worry about duplicated reads unless we are working with reads tagged with Unique Molecular Identifiers (UMI).

Finally we see whether or not we need to perform any trimming due to known adapter sequences being present at high levels in our data. This information will appear both in the “Adapter Content” and “Overrepresented sequences” tabs. Based on the output for this file, there is no adapter contamination in these files. This is likely because the GECF has already performed a trimming step to remove any remaining Illumina adapters.

Further examples of good and bad data can be found in these example reports.

MultiQC to summarize quality metrics

If you are working with a few samples, it isn’t too difficult to skim each fastqc report. However, if you are working with more than 10 files, this can become tedious. The tool MultiQC was built to further condense those results for better interpretation. MultiQC can be run on any directory that contains log files. In this case, we feed it the output directory of fastqc:

multiqc ./data/fastqc/

MultiQC outputs a summary html file that includes the same metrics output as discussed above. However, here it includes the results from all fastq files (instead of one at a time). In addition, MultiQC outputs a folder that includes much of the same information in text format. This is useful if you want to lead these files and create your own plots.

Adapter trimming

As we’ve seen in the fastqc output, occasionally some of our reads still have adapter contamination. These, as well as other sequences, such as poly-A runs, primer sequences (which can be seen in the “Overrepresented Seqences” tab) and low quality regions could hinder alignment, so it is important to remove these sequences from the reads before alignment. This can be done with the tool cutadapt.

To simply run cutadapt, all you need is the adapter sequence (-a), the output fastq path and file name (-o) and the input fastq file:

## single end gz
cutadapt -a AACCGGTT -o output.fastq.gz input.fastq.gz

Using this command, cutadapt searches for this adapter sequence in all reads and and removes it. All reads (even reads with a trimmed length 0) are reported in the output fastq file. If you want to filter reads from the fastq output, you can define the minimum length with the command -m LENGTH. Also note that this method allows for compressed input and output reads.

In addition to adapter sequences, reads can be trimmed based on a quality score (set with -q 10), which removes low quality bases at the 3’ end of reads, and poly-a tails (–poly-a).

Cutadapt also supports trimming of paired-end reads. To do so, we must define both the forward (-a) and reverse (-A) adapter sequences, the forward (-o) and reverse (-p) output files, and both input files:

## Paired end
cutadapt -a ADAPTER_FWD -A ADAPTER_REV \ ## Include adapter sequence
  -o out.1.fastq.gz -p out.2.fastq.gz \ ## indicate output fastq files
  --pair-filter=any \ ## Remove read if pair does not pass
  -m 22 \ ## Remove reads shorter than 22 basepairs
  -poly-a \ ## remove A tails
  -q 10 \ ## remove low quality bases from 3-prime ends of reads
  reads.1.fastq.gz reads.2.fastq.gz  ## Indicate input reads

Using cutadapt in paired-end mode has the added benefit of checking the paired file compatibility and filtering out both pairs in the case of a filtering command (as seen above with the last line of the command).

Check Alignment QC

After alignment, we typically want to check a few of the quality metrics to determine whether or not the alignment was successful.

SAM and BAM files

Alignment results are often reported in SAM format. These are output text files that store information about where reads map on the reference genome (as well as information about unmapped reads). However, these files are often converted into binary format (BAM files) in order to reduce file size and increase efficiency. Below is an example of the sam/bam format:

Each mapping instance is reported as a single line (shown above with line wrapping for easier viewing here).

First, we have the read name (as given in the fastq file), then we have bitwise flag information (here the flag indicates that this read was properly mapped). Next, we have the name of the chromosome the read aligned to (in this case it maps to chromosome 1). Then we have location on the chromosome, mapping quality, “CIGAR string” (details of mapping gaps and inserts), information about mate name (=) and location, and finally template length.

We can look up flags here: https://broadinstitute.github.io/picard/explain-flags.html

Then we have the sequence of the read and the quality of the read mapping. And finally we have the tags. These can give us additional information about things like how many times a read maps to the genome, and what is the UMI cell tag (in the case of single cell data).

Visualizing BAM mapping with IGV

We can use a genome viewer like IGV (Integrative Genomics Viewer) or the UCSC Genome Browser to visualize how each read has aligned to the genome in the bam files.

In this IGV snapshot, I’ve visualized how reads accumulate at the Fos gene locus. The top histogram (grey) represents the accumulated reads. Below that, each arrow-block is represents one read pair. Reads are colored by directionality and lines that split blocks represent reads that span splice junctions (also seen in the gene schematic at the bottom. In this instance, we can see that most reads correspond to exonic regions (gene boxes), as would be expected for mRNAs.

Alignment Statistics

STAR outputs a Log.final final that can give us information about how well the alignments went. We can load all of our alignment files into an R environment and compare mapping across all samples.

First we setup our environment:

## Set seed so that all downstream analysis is the same for all of us
set.seed(775)

## Load cran packages
if(!require("tidyverse", quietly = TRUE))
    install.packages("tidyverse")

Then, we list get the Log files in our bam folder:

## Load data
fls <- list.files("./data/bam_files", recursive = TRUE, ## List Log.final files
                    full.names = TRUE, pattern = "_Log.final.out")
names(fls) <- gsub("_Log.final.out","",basename(fls))  ## name files

fls
##                                                   5999_hpp_VS_Rep1 
## "./data/bam_files/5999_hpp_VS_Rep1/5999_hpp_VS_Rep1_Log.final.out" 
##                                                   6000_hpp_HS_Rep1 
## "./data/bam_files/6000_hpp_HS_Rep1/6000_hpp_HS_Rep1_Log.final.out" 
##                                                   6001_hpp_VS_Rep2 
## "./data/bam_files/6001_hpp_VS_Rep2/6001_hpp_VS_Rep2_Log.final.out" 
##                                                   6002_hpp_HS_Rep2 
## "./data/bam_files/6002_hpp_HS_Rep2/6002_hpp_HS_Rep2_Log.final.out" 
##                                                   6014_hpp_HS_Rep3 
## "./data/bam_files/6014_hpp_HS_Rep3/6014_hpp_HS_Rep3_Log.final.out" 
##                                                   6015_hpp_VS_Rep3 
## "./data/bam_files/6015_hpp_VS_Rep3/6015_hpp_VS_Rep3_Log.final.out" 
##                                                   6016_hpp_HS_Rep4 
## "./data/bam_files/6016_hpp_HS_Rep4/6016_hpp_HS_Rep4_Log.final.out" 
##                                                   6017_hpp_VS_Rep4 
## "./data/bam_files/6017_hpp_VS_Rep4/6017_hpp_VS_Rep4_Log.final.out"

We have one file per library. Next, we create a custom function to read each of these files and reorganize it so that it is easier to manipulate for downstream plotting:

## Read and reformat stats information
getStats <- function(fls) {
    ## Read file 
    stats <- read.delim(fls,
                        header = FALSE,
                        col.names = c("variable","value"))
    ## Organize statistics of interest into a dataframe
    data <- data.frame(total_reads = stats[5,2],
                       avg_read_length = stats[6,2],
                       uniq_mapped = stats[8,2],
                       uniq_mapped_perc = stats[9,2],
                       avg_mapped_length = stats[10,2],
                       multi_mapped_loci = stats[23,2],
                       multi_mapped_loci_perc = stats[24,2],
                       multi_mapped_too_many_loci = stats[25,2],
                       multi_mapped_too_many_loci_perc = stats[26,2],
                       unmapped_mismatch_perc= stats[29,2],
                       unmapped_short = stats[30,2]
                       )
}

stats <- lapply(fls, getStats) ## Run function on each file
stats <- do.call(rbind,stats)  ## create matrix of stats for all files
stats ## Vis matrix
##                  total_reads avg_read_length uniq_mapped uniq_mapped_perc
## 5999_hpp_VS_Rep1    38804161             182    36086590           93.00%
## 6000_hpp_HS_Rep1    40275240             182    37383373           92.82%
## 6001_hpp_VS_Rep2    33688333             182    31177854           92.55%
## 6002_hpp_HS_Rep2    34729325             182    32182606           92.67%
## 6014_hpp_HS_Rep3    37121145             182    34185730           92.09%
## 6015_hpp_VS_Rep3    39271560             182    36461961           92.85%
## 6016_hpp_HS_Rep4    38216703             182    35563982           93.06%
## 6017_hpp_VS_Rep4    42674023             182    39437635           92.42%
##                  avg_mapped_length multi_mapped_loci multi_mapped_loci_perc
## 5999_hpp_VS_Rep1            182.12           2114644                  5.45%
## 6000_hpp_HS_Rep1            182.21           2288155                  5.68%
## 6001_hpp_VS_Rep2            182.15           1945813                  5.78%
## 6002_hpp_HS_Rep2            182.21           1910836                  5.50%
## 6014_hpp_HS_Rep3            181.55           2288504                  6.16%
## 6015_hpp_VS_Rep3            182.19           2279085                  5.80%
## 6016_hpp_HS_Rep4            181.87           2120573                  5.55%
## 6017_hpp_VS_Rep4            181.99           2530946                  5.93%
##                  multi_mapped_too_many_loci multi_mapped_too_many_loci_perc
## 5999_hpp_VS_Rep1                      62130                           0.16%
## 6000_hpp_HS_Rep1                      70497                           0.18%
## 6001_hpp_VS_Rep2                      56509                           0.17%
## 6002_hpp_HS_Rep2                      58847                           0.17%
## 6014_hpp_HS_Rep3                      63367                           0.17%
## 6015_hpp_VS_Rep3                      61953                           0.16%
## 6016_hpp_HS_Rep4                      69648                           0.18%
## 6017_hpp_VS_Rep4                      67488                           0.16%
##                  unmapped_mismatch_perc unmapped_short
## 5999_hpp_VS_Rep1                  1.25%          0.14%
## 6000_hpp_HS_Rep1                  1.18%          0.15%
## 6001_hpp_VS_Rep2                  1.40%          0.11%
## 6002_hpp_HS_Rep2                  1.53%          0.14%
## 6014_hpp_HS_Rep3                  1.41%          0.17%
## 6015_hpp_VS_Rep3                  1.03%          0.16%
## 6016_hpp_HS_Rep4                  1.07%          0.14%
## 6017_hpp_VS_Rep4                  1.37%          0.12%

This gives us a data frame with information about the total numbers of fragments (paired reads) that were mapped to our genome. We also have information about how many of those reads mapped only one time (uniq), or multiple times (multi). or not at all (unmapped). It’s fairly simple to evaluate a few samples in table format. But it’s often much easier to see trends when we plot these data.

First, I like to plot the raw values to see if there are any outlying libraries. First, we reformat the data table so that it’s more easily accessible for the plotting commands (ggplot):

## Reformat read counts for plotting
raw <- stats |>
    select(!grep("perc|avg|unmapped", colnames(stats))) |> ## Select relevant data
    rownames_to_column( var = "sample") |> ## include sample names in table
    gather(key = "stat", value = "value", -sample) |> ## change data from wide to long
    mutate(value = as.numeric(value), ## create numeric values
           stat = factor(stat, levels = unique(stat))) ## Set statistic information as factor (easier plotting)

head (raw)
##             sample        stat    value
## 1 5999_hpp_VS_Rep1 total_reads 38804161
## 2 6000_hpp_HS_Rep1 total_reads 40275240
## 3 6001_hpp_VS_Rep2 total_reads 33688333
## 4 6002_hpp_HS_Rep2 total_reads 34729325
## 5 6014_hpp_HS_Rep3 total_reads 37121145
## 6 6015_hpp_VS_Rep3 total_reads 39271560
## Plot read counts
ggplot(raw, aes(x = sample, y = value, fill = sample)) + ## Setup plotting parameters
    geom_bar(stat = "identity", position = "dodge") + ## Indicate bar plot
    facet_wrap(~stat, scales = "free_y",nrow = 1) + ## Wrap plots based on the different values
    ## Aesthetic preference  
    theme_bw() +
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Are there big differences between any of your samples?

Next, I plot mapping percentages. Typically, we expect 70-90% unique mapping for bulk RNA-seq experiments. If one library has much less mapping this could indicate a contamination or another technical problem.

## Plot percent of counts 
perc <- stats |>
      select(grep("perc", colnames(stats))) |>
      rownames_to_column( var = "sample") |>
      gather(key = "stat", value = "value", -sample) |>
      mutate(perc = as.numeric(gsub ("%","",value)),
             stat = factor(stat, levels = unique(stat))) |>
      filter(perc > 1)

ggplot(perc, aes(x = sample, y = perc, fill = stat)) +
    geom_bar(stat = "identity", position = "stack") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          axis.title.x = element_blank()) +
    geom_text(aes(label=perc), position = "stack", vjust=1.5, size=3.5) +
    labs(y = "% total reads")

This data appears to have mapped well to the mouse genome so we can continue to the exciting part: Differential Expression Analysis.