Getting started

We’re going to use the R packages Tidyverse and ComplexHeatmap to look at the results of our metagenomic workflow and make some figures. The goal is to understand the data we’ve produced whilst thinking about some ways to present it.

Tidyverse is a collection of R packages that are very useful for data analysis. If you’re not already familiar with it, I highly recommend it but there are many different ways to work with data in R and you can choose which ever way works for you.

Load R packages

We’ll start by loading the R packages.

library(tidyverse)
library(ComplexHeatmap)

Load our data

Then we’ll open some of our data:

Taxonomic classifications from GTDB

In 05_Annotation/gtdb you should have a directory for each of your samples. Each sample directory will contain:

  • SAMPLE.bac120.summary.tsv : bacteria taxonomic classifications
  • SAMPLE.ar53.summary.tsv : archaea taxonomic classifications
Functional annotation from METABOLIC

Our functional annotation from METABOLIC is in 05_Annotation/metabolic/SAMPLE/. METABOLIC provides all of the data as a single Excel file, but if you want to load the data in R you can find each sheet of the excel file in METABOLIC_result_each_spreadsheet/. We’ll use the files:

  • METABOLIC_result_worksheet1.tsv : gene annotations
  • METABOLIC_result_worksheet2.tsv : function annotations
Representative MAGs from dRep

The dRep output is in 06_Representative_MAGs/drep/out/data_tables/ We’ll use the files:

  • Wdb.csv : a list of your representative genomes, the cluster of genomes they represent, and their score.
  • Cdb.csv : your complete list of genomes and the cluster the were assigned to.
  • genomeInfo.csv : completeness, contamination, length and N50 of each MAG
Relative abundance data from CoverM

The CoverM output is in 06_Representative_MAGs/coverm/

  • coverm_drep_mag_rel_abun.tsv : relative abundance data for representative MAGs

Taxonomic classifications from GTDB

# Read in the GTDB output files (read.csv) for bacteria and archaea for each sample. Header = (TRUE or FALSE), separator = tab.
# You'll need to change the sample names to match your dataset.

gtdb_bac_KR46_May <- 
  read.csv("../../../05_Annotation/gtdb/KR46_May/KR46_May.bac120.summary.tsv", 
           header = TRUE, sep = "\t")

gtdb_arc_KR46_May <- 
  read.csv("../../../05_Annotation/gtdb/KR46_May/KR46_May.ar53.summary.tsv", 
           header = TRUE, sep = "\t")

gtdb_bac_KR46_June <- 
  read.csv("../../../05_Annotation/gtdb/KR46_June/KR46_June.bac120.summary.tsv", 
           header = TRUE, sep = "\t")

gtdb_arc_KR46_June <- 
  read.csv("../../../05_Annotation/gtdb/KR46_June/KR46_June.ar53.summary.tsv", 
           header = TRUE, sep = "\t")

gtdb_bac_KR46_Sept <- 
  read.csv("../../../05_Annotation/gtdb/KR46_Sept/KR46_Sept.bac120.summary.tsv", 
           header = TRUE, sep = "\t")

gtdb_arc_KR46_Sept <- 
  read.csv("../../../05_Annotation/gtdb/KR46_Sept/KR46_Sept.ar53.summary.tsv", 
           header = TRUE, sep = "\t")

# Join all of our taxonomic information into one table called "gtdb_tab". (rbind = row binding).

gtdb_tab <- 
  rbind(gtdb_arc_KR46_June, 
        gtdb_arc_KR46_May, 
        gtdb_arc_KR46_Sept, 
        gtdb_bac_KR46_June, 
        gtdb_bac_KR46_May, 
        gtdb_bac_KR46_Sept)

# Select the columns "genome" and "classification" in "gtdb_tab".

gtdb_tab %>%
  select(user_genome, classification) %>%
  as_tibble %>%
  head()
## # A tibble: 6 × 2
##   user_genome              classification                                       
##   <chr>                    <chr>                                                
## 1 KR46_June_concoct_bin.11 d__Archaea;p__Methanobacteriota;c__Methanobacteria;o…
## 2 KR46_May_concoct_bin.32  d__Archaea;p__Methanobacteriota;c__Methanobacteria;o…
## 3 KR46_Sept_concoct_bin.6  d__Archaea;p__Methanobacteriota;c__Methanobacteria;o…
## 4 KR46_June_concoct_bin.19 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Peptostr…
## 5 KR46_June_concoct_bin.2  d__Bacteria;p__Bacillota_B;c__Desulfotomaculia;o__Am…
## 6 KR46_June_concoct_bin.26 d__Bacteria;p__Chloroflexota;c__Anaerolineae;o__Anae…

You’ll see that the taxonomy is one long string of text, so we’ll create a table where the taxonomy is separated into columns by rank.

# Create taxonomy table and rename the column "user_genome" to "genome".

gtdb_tax <-
  gtdb_tab %>%
  select(user_genome, classification) %>%
  rename(genome = user_genome) %>%
  separate(classification, 
           sep = ";", 
           into = c("domain", "phylum", "class", "order", "family", "genus", "species"), 
           remove = F) %>%
  as_tibble()

head(gtdb_tax)
## # A tibble: 6 × 9
##   genome           classification domain phylum class order family genus species
##   <chr>            <chr>          <chr>  <chr>  <chr> <chr> <chr>  <chr> <chr>  
## 1 KR46_June_conco… d__Archaea;p_… d__Ar… p__Me… c__M… o__M… f__Me… g__U… s__UBA…
## 2 KR46_May_concoc… d__Archaea;p_… d__Ar… p__Me… c__M… o__M… f__Me… g__U… s__UBA…
## 3 KR46_Sept_conco… d__Archaea;p_… d__Ar… p__Me… c__M… o__M… f__Me… g__U… s__UBA…
## 4 KR46_June_conco… d__Bacteria;p… d__Ba… p__Ba… c__C… o__P… f__An… g__P… s__    
## 5 KR46_June_conco… d__Bacteria;p… d__Ba… p__Ba… c__D… o__A… f__De… g__S… s__SUR…
## 6 KR46_June_conco… d__Bacteria;p… d__Ba… p__Ch… c__A… o__A… f__An… g__U… s__

Now we have separate columns for our GTDB taxonomic information in gtdb_tax.

Functional annotation from METABOLIC

# Read in the METABOLIC gene annotations. 
# Don't forget to change the sample names to match your dataset.

metabolism_KR46_May <- 
  read.csv("../../../05_Annotation/metabolic/KR46_May/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet1.tsv", 
           header = T, sep = "\t")

metabolism_KR46_June <- 
  read.csv("../../../05_Annotation/metabolic/KR46_June/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet1.tsv", 
           header = T, sep = "\t")

metabolism_KR46_Sept <- 
  read.csv("../../../05_Annotation/metabolic/KR46_Sept/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet1.tsv", 
           header = T, sep = "\t")

# Read in the METABOLIC function annotations.

function_KR46_May <- 
  read.csv("../../../05_Annotation/metabolic/KR46_May/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet2.tsv", 
           header = T, sep = "\t")

function_KR46_June <- 
  read.csv("../../../05_Annotation/metabolic/KR46_June/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet2.tsv", 
           header = T, sep = "\t")

function_KR46_Sept <- 
  read.csv("../../../05_Annotation/metabolic/KR46_Sept/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet2.tsv", 
           header = T, sep = "\t")

Representative MAGs from dRep

# Read in the dRep output files (read.csv). Notice this time the separator is a comma.

Wdb <- 
  read.csv("../../../06_Representative_MAGs/drep/out/data_tables/Wdb.csv", 
           header = TRUE, sep = ",")

Cdb <- 
  read.csv("../../../06_Representative_MAGs/drep/out/data_tables/Cdb.csv", 
           header = TRUE, sep = ",")

genomeInfo <- 
  read.csv("../../../06_Representative_MAGs/drep/out/data_tables/genomeInfo.csv", 
           header = TRUE, sep = ",")

# In the column "genome" below, we still have the ".fa" file extension.
head(Wdb)
##                        genome cluster     score
## 1 KR46_June_concoct_bin.11.fa     1_1  93.67456
## 2 KR46_Sept_concoct_bin.14.fa     2_1  88.95760
## 3   KR46_May_metabat.bin.5.fa     3_1 102.52197
## 4  KR46_June_concoct_bin.7.fa     4_0  63.56420
## 5 KR46_June_metabat.bin.10.fa     5_1 100.64033
## 6   KR46_May_metabat.bin.4.fa     6_1  88.32306
# Let's remove ".fa" to make it easier search and join with other data tables later.

# Representive MAGs
Rep_bins <- 
  Wdb %>%
  mutate(genome = str_remove(genome, ".fa"))

head(Rep_bins)
##                     genome cluster     score
## 1 KR46_June_concoct_bin.11     1_1  93.67456
## 2 KR46_Sept_concoct_bin.14     2_1  88.95760
## 3   KR46_May_metabat.bin.5     3_1 102.52197
## 4  KR46_June_concoct_bin.7     4_0  63.56420
## 5 KR46_June_metabat.bin.10     5_1 100.64033
## 6   KR46_May_metabat.bin.4     6_1  88.32306
# Cluster of all MAGs
Cdb_tab <- 
  Cdb %>%
  mutate(genome = str_remove(genome,".fa")) %>%
  select(genome, secondary_cluster)

head(Cdb_tab)
##                     genome secondary_cluster
## 1 KR46_June_concoct_bin.11               1_1
## 2  KR46_May_concoct_bin.32               1_1
## 3  KR46_Sept_concoct_bin.6               1_1
## 4  KR46_June_concoct_bin.2               2_1
## 5   KR46_May_concoct_bin.1               2_1
## 6 KR46_Sept_concoct_bin.14               2_1
# MAG quality info
BinQ_tab <-
  genomeInfo %>%
  mutate(genome = str_remove(genome, ".fa"))

head(BinQ_tab)
##                     genome completeness contamination  length    N50
## 1 KR46_June_concoct_bin.11        94.21          0.52 2099804  13050
## 2 KR46_June_concoct_bin.19        65.48         10.94 1296063   1710
## 3  KR46_June_concoct_bin.2        88.45          1.91 2226672   7934
## 4 KR46_June_concoct_bin.26        77.23          3.49 2863584   3661
## 5 KR46_June_concoct_bin.39        77.09          3.45 1498849   2611
## 6 KR46_June_concoct_bin.43       100.00          0.28 5264925 161854

Relative abundance data from CoverM

# Read in the CoverM output

CoverM <- 
  read.csv("../../../06_Representative_MAGs/coverm/coverm_drep_mag_rel_abun.tsv", 
           header = T, sep = "\t")

head(CoverM)
##                     Genome KR46_June.sorted.Relative.Abundance....
## 1                 unmapped                               0.1982570
## 2 KR46_June_concoct_bin.39                               1.8877897
## 3   KR46_May_metabat.bin.4                               6.2776270
## 4 KR46_June_concoct_bin.11                               1.5452732
## 5   KR46_May_metabat.bin.5                              38.8135260
## 6 KR46_June_concoct_bin.26                               0.3230628
##   KR46_May.sorted.Relative.Abundance....
## 1                              0.7948101
## 2                              0.2796628
## 3                             10.2860890
## 4                              0.9776185
## 5                             50.1354180
## 6                              0.1859733
##   KR46_Sept.sorted.Relative.Abundance....
## 1                               2.6669740
## 2                               0.8485058
## 3                               3.2700922
## 4                               4.1820410
## 5                              31.9509660
## 6                               0.3885868
## Let's rename the column headers in CoverM - you should use your own sample names.

colnames(CoverM)[1]="genome"
colnames(CoverM)[2]="KR46_June"
colnames(CoverM)[3]="KR46_May"
colnames(CoverM)[4]="KR46_Sept"

head(CoverM)
##                     genome  KR46_June   KR46_May  KR46_Sept
## 1                 unmapped  0.1982570  0.7948101  2.6669740
## 2 KR46_June_concoct_bin.39  1.8877897  0.2796628  0.8485058
## 3   KR46_May_metabat.bin.4  6.2776270 10.2860890  3.2700922
## 4 KR46_June_concoct_bin.11  1.5452732  0.9776185  4.1820410
## 5   KR46_May_metabat.bin.5 38.8135260 50.1354180 31.9509660
## 6 KR46_June_concoct_bin.26  0.3230628  0.1859733  0.3885868

Summarise data

Now we’ll join some of our data tables together to summarise our results.

# Create a summary from the data tables "Cdb_tab", "BinQ_tab" and "gtdb_tax".
# The tables are being joined by the genome name in the column "genome".

MAG_summary <-
  as_tibble(left_join(Cdb_tab, BinQ_tab, "genome") %>%
  left_join(., gtdb_tax, "genome"))

head(MAG_summary)
## # A tibble: 6 × 14
##   genome               secondary_cluster completeness contamination length   N50
##   <chr>                <chr>                    <dbl>         <dbl>  <int> <int>
## 1 KR46_June_concoct_b… 1_1                       94.2          0.52 2.10e6 13050
## 2 KR46_May_concoct_bi… 1_1                       75.7          1.25 1.70e6  2736
## 3 KR46_Sept_concoct_b… 1_1                       92.5          0.2  2.08e6  8728
## 4 KR46_June_concoct_b… 2_1                       88.4          1.91 2.23e6  7934
## 5 KR46_May_concoct_bi… 2_1                       88.8          1.72 2.03e6 14518
## 6 KR46_Sept_concoct_b… 2_1                      100.           2.68 2.34e6 50449
## # ℹ 8 more variables: classification <chr>, domain <chr>, phylum <chr>,
## #   class <chr>, order <chr>, family <chr>, genus <chr>, species <chr>
# Create a summary from the data tables "Cdb_tab", "BinQ_tab", "gtdb_tax" and "CoverM.
# The tables are being joined by the genome name in the column "genome".

MAG_rep_summary <-
  as_tibble(left_join(Rep_bins, BinQ_tab, "genome") %>%
  left_join(., gtdb_tax, "genome") %>%
  left_join(., CoverM, "genome"))

head(MAG_rep_summary)
## # A tibble: 6 × 18
##   genome   cluster score completeness contamination length    N50 classification
##   <chr>    <chr>   <dbl>        <dbl>         <dbl>  <int>  <int> <chr>         
## 1 KR46_Ju… 1_1      93.7         94.2          0.52 2.10e6  13050 d__Archaea;p_…
## 2 KR46_Se… 2_1      89.0        100.           2.68 2.34e6  50449 d__Bacteria;p…
## 3 KR46_Ma… 3_1     103.         100.           0.01 3.41e6 133245 d__Bacteria;p…
## 4 KR46_Ju… 4_0      63.6         79.2          3.28 2.39e6   3224 d__Bacteria;p…
## 5 KR46_Ju… 5_1     101.         100            0.4  3.74e6 174145 d__Bacteria;p…
## 6 KR46_Ma… 6_1      88.3         98.5          2.56 4.64e6 161636 d__Bacteria;p…
## # ℹ 10 more variables: domain <chr>, phylum <chr>, class <chr>, order <chr>,
## #   family <chr>, genus <chr>, species <chr>, KR46_June <dbl>, KR46_May <dbl>,
## #   KR46_Sept <dbl>

Now we’ve created our data tables, let’s move on to making some simple plots.

Data tables:

If you want to save these tables (e.g., for use in Excel) you can save them with write_delim(MAG_rep_summary, "MAG_rep_summary.tsv") and a tab separated file will be saved to your working directory.

Plotting relative abundance data

Let’s start with the relative abundance data from CoverM. To create plots in ggplot2, it’s easier to have the data in “long format”. We’ll look at this below.

# Remind ourselves what data we have in the table "MAG_rep_summary"

colnames(MAG_rep_summary)
##  [1] "genome"         "cluster"        "score"          "completeness"  
##  [5] "contamination"  "length"         "N50"            "classification"
##  [9] "domain"         "phylum"         "class"          "order"         
## [13] "family"         "genus"          "species"        "KR46_June"     
## [17] "KR46_May"       "KR46_Sept"
# To create a "long format" table from a "wide format" table, we can use the command `pivot_longer`
# My CoverM abundance data is in the columns KR46_May, KR46_June and KR46_Sept, i.e., columns 16:18 above.
# In the "pivot_longer" command below, I've selected those columns. If you have more samples, you will need to adjust the column numbers (16:18) to select all of your sample data.

MAG_rep_summary_l <-
  MAG_rep_summary %>%
  pivot_longer(16:18, names_to = "sample", values_to = "abundance")

# What is "long format" and "wide format". It's easier to visualise:
# Example of wide format - the abundance data for each sample is in it's own column.

MAG_rep_summary %>% 
  select(genome, KR46_May, KR46_June, KR46_Sept)
## # A tibble: 12 × 4
##    genome                   KR46_May KR46_June KR46_Sept
##    <chr>                       <dbl>     <dbl>     <dbl>
##  1 KR46_June_concoct_bin.11    0.978    1.55       4.18 
##  2 KR46_Sept_concoct_bin.14    3.61     3.31      10.3  
##  3 KR46_May_metabat.bin.5     50.1     38.8       32.0  
##  4 KR46_June_concoct_bin.7     0.210    0.280      0.216
##  5 KR46_June_metabat.bin.10   22.5     24.2       19.2  
##  6 KR46_May_metabat.bin.4     10.3      6.28       3.27 
##  7 KR46_Sept_concoct_bin.23    0.340    0.0784     2.97 
##  8 KR46_June_concoct_bin.39    0.280    1.89       0.849
##  9 KR46_May_concoct_bin.3      8.63    16.1       14.3  
## 10 KR46_June_concoct_bin.26    0.186    0.323      0.389
## 11 KR46_June_metabat.bin.21    0.636    6.81       0.325
## 12 KR46_May_concoct_bin.11     1.45     0.194      9.41
# Example of long format - the abundance data for every sample is in the column "abundance". We therefore also have an additional column called "sample".

MAG_rep_summary_l %>% 
  select(genome, sample, abundance)
## # A tibble: 36 × 3
##    genome                   sample    abundance
##    <chr>                    <chr>         <dbl>
##  1 KR46_June_concoct_bin.11 KR46_June     1.55 
##  2 KR46_June_concoct_bin.11 KR46_May      0.978
##  3 KR46_June_concoct_bin.11 KR46_Sept     4.18 
##  4 KR46_Sept_concoct_bin.14 KR46_June     3.31 
##  5 KR46_Sept_concoct_bin.14 KR46_May      3.61 
##  6 KR46_Sept_concoct_bin.14 KR46_Sept    10.3  
##  7 KR46_May_metabat.bin.5   KR46_June    38.8  
##  8 KR46_May_metabat.bin.5   KR46_May     50.1  
##  9 KR46_May_metabat.bin.5   KR46_Sept    32.0  
## 10 KR46_June_concoct_bin.7  KR46_June     0.280
## # ℹ 26 more rows

Stacked bar plot

One simple way to view abundance data is with a stacked bar plot. We can do this with ggplot2, which is a package included in tidyverse.

# Using the data table "MAG_rep_summary_l", we want "sample" on the x axis, "abundance" on the y axis, and the bars to be filled with colour by "genus".

stack_plot1 <-
  ggplot(MAG_rep_summary_l, aes(x = sample, y = abundance, fill = genus)) +
    geom_bar(stat = "identity")

stack_plot1

This looks great in this example where we only have 12 genera, but if you have many more samples or groupings it can become hard to read. A bubble plot is another way to present abundance data and it works better with more samples.

Bubble plot

# Let's use the same data as above, but plot with "geom_point" instead of "geom_bar"

bubble_plot1 <-
  MAG_rep_summary_l %>%
    ggplot(aes(x = sample, y = genome, size = abundance, fill = genus)) +
    geom_point(shape = 21, alpha = 1)

bubble_plot1

This looks ok but there are a few things we could do to improve it and make it easier to read.

# Plot samples in the correct order: by date in this example

MAG_rep_summary_l$sample <- 
  factor(MAG_rep_summary_l$sample, 
         levels = c("KR46_May", "KR46_June", "KR46_Sept"))

# Use "facet_grid" to group samples by phylum

MAG_rep_summary_l %>%
  ggplot(aes(x = sample, y = genome, size = abundance, fill = phylum)) +
  geom_point(shape = 21, alpha = 1) +
  scale_size_area(max_size = 10) +
  facet_grid(rows = vars(phylum), scales = "free", space = "free") +
  theme(strip.text.y = element_text(colour = NA),
        strip.background.y = element_blank())

# Alternatively, reorder by the variable phylum, so like colours are grouped together

bubble_plot2 <-
  MAG_rep_summary_l %>%
    mutate(genome = fct_reorder(genome, desc(phylum))) %>%
    ggplot(aes(x = sample, y = genome, size = abundance, fill = phylum)) +
    geom_point(shape = 21, alpha = 1) +
    scale_size_area(max_size = 10)

bubble_plot2

# You can also use "tab" to look through the different "theme" options. Let's try "theme_linedraw" and increase the maximum size of our bubbles with the option "scale_size_area".

bubble_plot3 <- 
  MAG_rep_summary_l %>%
    mutate(genome = fct_reorder(genome, desc(phylum))) %>%
    ggplot(aes(x = sample, y = genome, size = abundance, fill = phylum)) +
    geom_point(shape = 21, alpha = 1) +
    scale_size_area(max_size = 15) +
    theme_linedraw()

bubble_plot3

There are many other ways you can change the look of your plots. If you google “ggplot2” you’ll find endless tutorials and tips. Next let’s take a look at some of our functional annotations from METABOLIC.

Presence/absence heatmap

We’ll start by getting our METABOLIC data into a more friendly format.

# We're selecting the columns "Category", "Gene.abbreviation", and "SAMPLE.Hmm.presence"
# Converting to long format with "pivot_longer"
# Then removing ".Hmm.presence" from the end of our sample names. You'll need to do this for each of your samples.

metab_May <-
  metabolism_KR46_May %>% 
  select(Category, Gene.abbreviation, ends_with("presence")) %>% 
  pivot_longer(!c(1:2), names_to = "genome", values_to = "presence") %>% 
  mutate(genome = str_remove(genome, ".Hmm.presence"))

metab_June <-
  metabolism_KR46_June %>% 
  select(Category, Gene.abbreviation, ends_with("presence")) %>% 
  pivot_longer(!c(1:2), names_to = "genome", values_to = "presence") %>% 
  mutate(genome = str_remove(genome, ".Hmm.presence"))

metab_Sept <-
  metabolism_KR46_Sept %>% 
  select(Category, Gene.abbreviation, ends_with("presence")) %>% 
  pivot_longer(!c(1:2), names_to = "genome", values_to = "presence") %>% 
  mutate(genome = str_remove(genome, ".Hmm.presence"))


# Now we can join each of the sample files together.
# If you have more than 3, list all of your samples names in the brackets after rbind). 
# We're also replacing the word "Present" with the number "1" and the word "Absent" with the number "0".

Metabolism_summary <- 
  rbind(metab_May, metab_June, metab_Sept) %>%
  mutate(presence = case_when(presence == "Present" ~ 1,
                              presence == "Absent" ~ 0))


# Convert "presence" column from continuous (0 to 1) to discrete (0 and 1).

Metabolism_summary$presence <- as.factor(Metabolism_summary$presence)


# "Metabolism_summary" contains the annotation for all MAGs from each sample, so we'll also create a summary of just the representative MAGs by filtering based on the "genome" column in the table "Rep_bins" which holds our representative bins from dRep.

Metabolism_rep_summary <-
  Metabolism_summary %>%
  filter(genome %in% Rep_bins$genome)

head(Metabolism_rep_summary)
## # A tibble: 6 × 4
##   Category               Gene.abbreviation                       genome presence
##   <chr>                  <chr>                                   <chr>  <fct>   
## 1 Thermophilic specific  rgy                                     KR46_… 0       
## 2 Thermophilic specific  rgy                                     KR46_… 0       
## 3 Thermophilic specific  rgy                                     KR46_… 0       
## 4 Thermophilic specific  rgy                                     KR46_… 0       
## 5 Amino acid utilization 4-aminobutyrate aminotransferase and r… KR46_… 0       
## 6 Amino acid utilization 4-aminobutyrate aminotransferase and r… KR46_… 1

Now we have our METABOLIC gene annotations summarised in:

# Let's see what metabolism categories we have to choose from in our METABOLIC output
Metabolism_summary %>% 
  select(Category) %>% 
  distinct() %>% 
  print(n = 30)
## # A tibble: 25 × 1
##    Category                                                
##    <chr>                                                   
##  1 Thermophilic specific                                   
##  2 Amino acid utilization                                  
##  3 Ethanol fermentation                                    
##  4 Fatty acid degradation                                  
##  5 Aromatics degradation                                   
##  6 Complex carbon degradation                              
##  7 Fermentation                                            
##  8 C1 metabolism                                           
##  9 Methane metabolism                                      
## 10 Carbon fixation                                         
## 11 Nitrogen cycling                                        
## 12 Sulfur cycling                                          
## 13 Hydrogenases                                            
## 14 Oxidative phosphorylation                               
## 15 Oxygen metabolism (Oxidative phosphorylation Complex IV)
## 16 Urea utilization                                        
## 17 Halogenated compound utilization                        
## 18 Perchlorate reduction                                   
## 19 Chlorite reduction                                      
## 20 As cycling                                              
## 21 Selenate reduction                                      
## 22 Nitrile hydration                                       
## 23 Iron cycling                                            
## 24 Manganese cycling                                       
## 25 Sulfur cycling enzymes (detailed)
# In this example, I'm going to choose the category "Sulfur cycling" to look at in more detail.
sulfur_cycle <-
  Metabolism_rep_summary %>%
  filter(Category == "Sulfur cycling")

# And create a simple heatmap, or presence/absence plot. Again we're using ggplot, but this time we select "geom_tile"
gene_plot1 <-
  sulfur_cycle %>% 
      ggplot(aes(x = Gene.abbreviation, y = genome, fill = presence)) + 
      geom_tile(colour = "black") +
      scale_fill_manual(values = c("white", "blue")) +
      scale_x_discrete(expand=c(0, 0)) +
      scale_y_discrete(expand=c(0, 0)) +
      coord_fixed() +
      theme(axis.text.x = element_text(angle = 90))

gene_plot1

This plot shows us genes for “Sulfur cyling” on the x-axis, but it could also be useful to use the summary of different functions provided on the second worksheet from METABOLIC.

# Look at the columns headers for worksheet2 from METABOLIC
colnames(function_KR46_May)
##  [1] "Category"                                 
##  [2] "Function"                                 
##  [3] "Gene.abbreviation"                        
##  [4] "KR46_May_concoct_bin.1.Function.presence" 
##  [5] "KR46_May_concoct_bin.11.Function.presence"
##  [6] "KR46_May_concoct_bin.3.Function.presence" 
##  [7] "KR46_May_concoct_bin.32.Function.presence"
##  [8] "KR46_May_metabat.bin.12.Function.presence"
##  [9] "KR46_May_metabat.bin.4.Function.presence" 
## [10] "KR46_May_metabat.bin.5.Function.presence"
# Put it into a more friendly format: again we're converting to long format, and removing the string ".Function.presence" from the end of our sample names.

func_May <-
   function_KR46_May %>%
  pivot_longer(!c(1:3), names_to = "genome", values_to = "presence") %>%
  mutate(genome = str_remove(genome, ".Function.presence"))

func_June <-
  function_KR46_June %>%
  pivot_longer(!c(1:3), names_to = "genome", values_to = "presence") %>%
  mutate(genome = str_remove(genome, ".Function.presence"))

func_Sept <-
  function_KR46_Sept %>%
  pivot_longer(!c(1:3), names_to = "genome", values_to = "presence") %>%
  mutate(genome = str_remove(genome, ".Function.presence"))

# Join all of our samples together into one summary. We'll also replace the words "Present" and "Absent" with the numbers "1" and "0" again.

Function_summary <-
  rbind(func_May, func_June, func_Sept) %>%
  mutate(presence = case_when(presence == "Present" ~ 1,
                              presence == "Absent" ~ 0))

# Convert "presence" column from continuous (0 to 1) to discrete (0 and 1).

Function_summary$presence <- as.factor(Function_summary$presence)

# "Function_summary" contains the function annotation for all MAGs from each sample, so let's also create a summary of just the representatives by filtering based on the "genome" column in the table "Rep_bins" which holds our representative bins from dRep.

Function_rep_summary <-
  Function_summary %>%
  filter(genome %in% Rep_bins$genome)

Now we have our METABOLIC function annotations summarised in:

# Plot everything

func_plot1 <-
  Function_rep_summary %>% 
    ggplot(aes(x = genome, y = Function, fill = presence)) + 
      geom_tile(colour = "black") +
      scale_fill_manual(values = c("white","purple")) +
      coord_fixed() +
      scale_x_discrete(expand=c(0, 0)) +
      scale_y_discrete(expand=c(0, 0)) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

func_plot1

That’s a bit hard to read. Let’s select some metabolisms we’re interested in based on the “Categories” provided by METABOLIC.

# Select metabolic categories and plot again. This time were using "facet_grid" to group the columns by the categories we've chosen.

func_plot2 <-
  Function_rep_summary %>%
      filter(Category == "Sulfur cycling" | Category == "Methane metabolism" | Category == "Carbon fixation") %>%
      ggplot(aes(x = Function, y = genome, fill = presence)) + 
      geom_tile(colour = "black") +
      scale_fill_manual(values = c("white", "purple")) +
      facet_grid(cols = vars(Category), scales = "free", space = "free") +
      scale_x_discrete(expand=c(0, 0)) +
      scale_y_discrete(expand=c(0, 0)) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

func_plot2

It would be nice to also include some taxonomy information. We can get that information from our MAG_rep_summary table.

# Join "Function_rep_summary" with "MAG_rep_summary" by the common column "genome".
genome_tab <-
  left_join(Function_rep_summary, MAG_rep_summary, by = "genome")

# Now we can create the same plot using the joined table we just created called "genome_tab". We'll use "facet_grid" to group the rows by the variable "phylum".

func_plot3 <-
  genome_tab %>%
      filter(Category == "Sulfur cycling" | Category == "Methane metabolism" | Category == "Carbon fixation") %>%
      ggplot(aes(x = Function, y = genome, fill = presence)) + 
      geom_tile(colour = "black") +
      scale_fill_manual(values = c("white", "purple")) +
      facet_grid(cols = vars(Category), rows = vars(phylum), scales = "free", space = "free") +
      scale_x_discrete(expand=c(0, 0)) +
      scale_y_discrete(expand=c(0, 0)) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
            strip.text.y = element_text(angle = 0))

func_plot3

You can play around with the ggplot settings to improve the appearance of your plots and/or beautify your final plots for publication in a program like InkScape.

To save any plot you generate, you can use the command ggsave(filename = "YOUR_PLOT_NAME.svg", plot = YOUR_PLOT). You can also change the size of the file you save by adding width = and length=, or save in different formats by changing the extension of your file name (e.g., .png or .pdf).

The package ComplexHeatmap can also hold many more different variables. Let’s have a go.

ComplexHeatmap

ComplexHeatmap likes the data to be in a matrix, so let’s start by creating a matrix with our CoverM abundance data.

# Remind ourselves what is in the data table holding our CoverM abundance data
colnames(MAG_rep_summary)
##  [1] "genome"         "cluster"        "score"          "completeness"  
##  [5] "contamination"  "length"         "N50"            "classification"
##  [9] "domain"         "phylum"         "class"          "order"         
## [13] "family"         "genus"          "species"        "KR46_June"     
## [17] "KR46_May"       "KR46_Sept"
# My CoverM abundance data is in columns "KR46_June", "KR46_May", "KR46_Sept" i.e., column numbers 16, 17, 18. 
# Now we'll create a matrix using those columns
# I've selected them in the order I want them to appear (17 = May, 16 = June, 18 = July)
mat1 <- 
  as.matrix(MAG_rep_summary %>% 
              select(17, 16, 18))

head(mat1)
##        KR46_May  KR46_June  KR46_Sept
## [1,]  0.9776185  1.5452732  4.1820410
## [2,]  3.6091943  3.3117273 10.3326100
## [3,] 50.1354180 38.8135260 31.9509660
## [4,]  0.2099092  0.2796721  0.2164297
## [5,] 22.4642030 24.1804350 19.1765670
## [6,] 10.2860890  6.2776270  3.2700922
# Now we'll create column annotations
# In our matrix above (mat1) the column headers are "KR46_May", "KR46_June", "KR46_Sept".
# I'm going to create a "column annotation" for those samples, here I'll use months, but it could be any metadata variable.
# The annotations must be in the same order as your matrix

col_anno1 <- data.frame(month = c("May", "June", "Sept"))

# We can also define which colours we'd like to use for a variable
ann_colors = list(
  month = c(May = "#D3F6DB", June = "#92D5E6", Sept = "#772D8B"))

# The rows in our matrix don't currently have a row name. 
# This will add the names of our genomes to the row name.
rownames(mat1) = paste(MAG_rep_summary$genome)

head(mat1)
##                            KR46_May  KR46_June  KR46_Sept
## KR46_June_concoct_bin.11  0.9776185  1.5452732  4.1820410
## KR46_Sept_concoct_bin.14  3.6091943  3.3117273 10.3326100
## KR46_May_metabat.bin.5   50.1354180 38.8135260 31.9509660
## KR46_June_concoct_bin.7   0.2099092  0.2796721  0.2164297
## KR46_June_metabat.bin.10 22.4642030 24.1804350 19.1765670
## KR46_May_metabat.bin.4   10.2860890  6.2776270  3.2700922
# Now we'll create row annotations
# Let's use the "phylum" column from "MAG_rep_summary".
row_anno1 <- data.frame(phylum = MAG_rep_summary$phylum)

# And finally, draw a heatmap using our matrix "mat1" and the column and row annotations we created.
heat1 <- pheatmap(mat1,
                  cluster_rows = F, cluster_cols = F,
                  heatmap_legend_param = list(title = "Relative abundance (%)"),
                  annotation_row = row_anno1,
                  annotation_col = col_anno1,
                  annotation_colors = ann_colors)

heat1

You can create as many row and column annotations as you want. ComplexHeatmap can also integrate bar plots. Let’s create some bar plots using the “Completeness” and “Contamination” value for our MAGs found in the table MAG_rep_summary.

# create bar plot of completeness
comp <- rowAnnotation(completeness = anno_barplot(MAG_rep_summary$completeness, gp = gpar(fill = "pink")),
                      annotation_name_side = c("bottom"), 
                      annotation_name_rot = c(90))

# create bar plot of contamination
con <- rowAnnotation(contamination = anno_barplot(MAG_rep_summary$contamination, gp = gpar(fill = "yellow")),
                      annotation_name_side = c("bottom"), 
                      annotation_name_rot = c(90))

# Draw these barplots with the heatmap we created above
draw(comp + con + heat1)

Looks good? Great! The purpose of this lesson was to give you some ideas of how to present your data. As you can see there are many ways to do it and lots of different settings. I hope this was a useful starting off point.

Next we’ll generate a phylogenomic tree using GToTree. We’ll do this on SCITAS following the instructions on GitHub, but you may find it useful to look at some of the data tables we’ve generated to choose which MAGs to include in your phylogenomic tree so let’s save our GTDB taxonomy table:

# Save "gtdb_tax" as a tsv file and open it in Excel
write_delim(gtdb_tax, "gtdb_taxonomy.tsv", delim = "\t")