We’re going to use the R packages Tidyverse and ComplexHeatmap to look at the results of our metagenomic workflow and make some figures. The goal is to understand the data we’ve produced whilst thinking about some ways to present it.
Tidyverse is a collection of R packages that are very useful for data analysis. If you’re not already familiar with it, I highly recommend it but there are many different ways to work with data in R and you can choose which ever way works for you.
We’ll start by loading the R packages.
library(tidyverse)
library(ComplexHeatmap)
Then we’ll open some of our data:
In 05_Annotation/gtdb you should have a
directory for each of your samples. Each sample directory will
contain:
SAMPLE.bac120.summary.tsv : bacteria
taxonomic classificationsSAMPLE.ar53.summary.tsv : archaea
taxonomic classificationsOur functional annotation from METABOLIC is in
05_Annotation/metabolic/SAMPLE/. METABOLIC
provides all of the data as a single Excel file, but if you want to load
the data in R you can find each sheet of the excel file in
METABOLIC_result_each_spreadsheet/. We’ll
use the files:
METABOLIC_result_worksheet1.tsv : gene
annotationsMETABOLIC_result_worksheet2.tsv :
function annotationsThe dRep output is in
06_Representative_MAGs/drep/out/data_tables/
We’ll use the files:
Wdb.csv : a list of your
representative genomes, the cluster of genomes they represent, and their
score.Cdb.csv : your complete list of
genomes and the cluster the were assigned to.genomeInfo.csv : completeness,
contamination, length and N50 of each MAGThe CoverM output is in
06_Representative_MAGs/coverm/
coverm_drep_mag_rel_abun.tsv :
relative abundance data for representative MAGs# Read in the GTDB output files (read.csv) for bacteria and archaea for each sample. Header = (TRUE or FALSE), separator = tab.
# You'll need to change the sample names to match your dataset.
gtdb_bac_KR46_May <-
read.csv("../../../05_Annotation/gtdb/KR46_May/KR46_May.bac120.summary.tsv",
header = TRUE, sep = "\t")
gtdb_arc_KR46_May <-
read.csv("../../../05_Annotation/gtdb/KR46_May/KR46_May.ar53.summary.tsv",
header = TRUE, sep = "\t")
gtdb_bac_KR46_June <-
read.csv("../../../05_Annotation/gtdb/KR46_June/KR46_June.bac120.summary.tsv",
header = TRUE, sep = "\t")
gtdb_arc_KR46_June <-
read.csv("../../../05_Annotation/gtdb/KR46_June/KR46_June.ar53.summary.tsv",
header = TRUE, sep = "\t")
gtdb_bac_KR46_Sept <-
read.csv("../../../05_Annotation/gtdb/KR46_Sept/KR46_Sept.bac120.summary.tsv",
header = TRUE, sep = "\t")
gtdb_arc_KR46_Sept <-
read.csv("../../../05_Annotation/gtdb/KR46_Sept/KR46_Sept.ar53.summary.tsv",
header = TRUE, sep = "\t")
# Join all of our taxonomic information into one table called "gtdb_tab". (rbind = row binding).
gtdb_tab <-
rbind(gtdb_arc_KR46_June,
gtdb_arc_KR46_May,
gtdb_arc_KR46_Sept,
gtdb_bac_KR46_June,
gtdb_bac_KR46_May,
gtdb_bac_KR46_Sept)
# Select the columns "genome" and "classification" in "gtdb_tab".
gtdb_tab %>%
select(user_genome, classification) %>%
as_tibble %>%
head()
## # A tibble: 6 × 2
## user_genome classification
## <chr> <chr>
## 1 KR46_June_concoct_bin.11 d__Archaea;p__Methanobacteriota;c__Methanobacteria;o…
## 2 KR46_May_concoct_bin.32 d__Archaea;p__Methanobacteriota;c__Methanobacteria;o…
## 3 KR46_Sept_concoct_bin.6 d__Archaea;p__Methanobacteriota;c__Methanobacteria;o…
## 4 KR46_June_concoct_bin.19 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Peptostr…
## 5 KR46_June_concoct_bin.2 d__Bacteria;p__Bacillota_B;c__Desulfotomaculia;o__Am…
## 6 KR46_June_concoct_bin.26 d__Bacteria;p__Chloroflexota;c__Anaerolineae;o__Anae…
You’ll see that the taxonomy is one long string of text, so we’ll create a table where the taxonomy is separated into columns by rank.
# Create taxonomy table and rename the column "user_genome" to "genome".
gtdb_tax <-
gtdb_tab %>%
select(user_genome, classification) %>%
rename(genome = user_genome) %>%
separate(classification,
sep = ";",
into = c("domain", "phylum", "class", "order", "family", "genus", "species"),
remove = F) %>%
as_tibble()
head(gtdb_tax)
## # A tibble: 6 × 9
## genome classification domain phylum class order family genus species
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 KR46_June_conco… d__Archaea;p_… d__Ar… p__Me… c__M… o__M… f__Me… g__U… s__UBA…
## 2 KR46_May_concoc… d__Archaea;p_… d__Ar… p__Me… c__M… o__M… f__Me… g__U… s__UBA…
## 3 KR46_Sept_conco… d__Archaea;p_… d__Ar… p__Me… c__M… o__M… f__Me… g__U… s__UBA…
## 4 KR46_June_conco… d__Bacteria;p… d__Ba… p__Ba… c__C… o__P… f__An… g__P… s__
## 5 KR46_June_conco… d__Bacteria;p… d__Ba… p__Ba… c__D… o__A… f__De… g__S… s__SUR…
## 6 KR46_June_conco… d__Bacteria;p… d__Ba… p__Ch… c__A… o__A… f__An… g__U… s__
Now we have separate columns for our GTDB taxonomic information in
gtdb_tax.
# Read in the METABOLIC gene annotations.
# Don't forget to change the sample names to match your dataset.
metabolism_KR46_May <-
read.csv("../../../05_Annotation/metabolic/KR46_May/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet1.tsv",
header = T, sep = "\t")
metabolism_KR46_June <-
read.csv("../../../05_Annotation/metabolic/KR46_June/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet1.tsv",
header = T, sep = "\t")
metabolism_KR46_Sept <-
read.csv("../../../05_Annotation/metabolic/KR46_Sept/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet1.tsv",
header = T, sep = "\t")
# Read in the METABOLIC function annotations.
function_KR46_May <-
read.csv("../../../05_Annotation/metabolic/KR46_May/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet2.tsv",
header = T, sep = "\t")
function_KR46_June <-
read.csv("../../../05_Annotation/metabolic/KR46_June/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet2.tsv",
header = T, sep = "\t")
function_KR46_Sept <-
read.csv("../../../05_Annotation/metabolic/KR46_Sept/METABOLIC_result_each_spreadsheet/METABOLIC_result_worksheet2.tsv",
header = T, sep = "\t")
# Read in the dRep output files (read.csv). Notice this time the separator is a comma.
Wdb <-
read.csv("../../../06_Representative_MAGs/drep/out/data_tables/Wdb.csv",
header = TRUE, sep = ",")
Cdb <-
read.csv("../../../06_Representative_MAGs/drep/out/data_tables/Cdb.csv",
header = TRUE, sep = ",")
genomeInfo <-
read.csv("../../../06_Representative_MAGs/drep/out/data_tables/genomeInfo.csv",
header = TRUE, sep = ",")
# In the column "genome" below, we still have the ".fa" file extension.
head(Wdb)
## genome cluster score
## 1 KR46_June_concoct_bin.11.fa 1_1 93.67456
## 2 KR46_Sept_concoct_bin.14.fa 2_1 88.95760
## 3 KR46_May_metabat.bin.5.fa 3_1 102.52197
## 4 KR46_June_concoct_bin.7.fa 4_0 63.56420
## 5 KR46_June_metabat.bin.10.fa 5_1 100.64033
## 6 KR46_May_metabat.bin.4.fa 6_1 88.32306
# Let's remove ".fa" to make it easier search and join with other data tables later.
# Representive MAGs
Rep_bins <-
Wdb %>%
mutate(genome = str_remove(genome, ".fa"))
head(Rep_bins)
## genome cluster score
## 1 KR46_June_concoct_bin.11 1_1 93.67456
## 2 KR46_Sept_concoct_bin.14 2_1 88.95760
## 3 KR46_May_metabat.bin.5 3_1 102.52197
## 4 KR46_June_concoct_bin.7 4_0 63.56420
## 5 KR46_June_metabat.bin.10 5_1 100.64033
## 6 KR46_May_metabat.bin.4 6_1 88.32306
# Cluster of all MAGs
Cdb_tab <-
Cdb %>%
mutate(genome = str_remove(genome,".fa")) %>%
select(genome, secondary_cluster)
head(Cdb_tab)
## genome secondary_cluster
## 1 KR46_June_concoct_bin.11 1_1
## 2 KR46_May_concoct_bin.32 1_1
## 3 KR46_Sept_concoct_bin.6 1_1
## 4 KR46_June_concoct_bin.2 2_1
## 5 KR46_May_concoct_bin.1 2_1
## 6 KR46_Sept_concoct_bin.14 2_1
# MAG quality info
BinQ_tab <-
genomeInfo %>%
mutate(genome = str_remove(genome, ".fa"))
head(BinQ_tab)
## genome completeness contamination length N50
## 1 KR46_June_concoct_bin.11 94.21 0.52 2099804 13050
## 2 KR46_June_concoct_bin.19 65.48 10.94 1296063 1710
## 3 KR46_June_concoct_bin.2 88.45 1.91 2226672 7934
## 4 KR46_June_concoct_bin.26 77.23 3.49 2863584 3661
## 5 KR46_June_concoct_bin.39 77.09 3.45 1498849 2611
## 6 KR46_June_concoct_bin.43 100.00 0.28 5264925 161854
# Read in the CoverM output
CoverM <-
read.csv("../../../06_Representative_MAGs/coverm/coverm_drep_mag_rel_abun.tsv",
header = T, sep = "\t")
head(CoverM)
## Genome KR46_June.sorted.Relative.Abundance....
## 1 unmapped 0.1982570
## 2 KR46_June_concoct_bin.39 1.8877897
## 3 KR46_May_metabat.bin.4 6.2776270
## 4 KR46_June_concoct_bin.11 1.5452732
## 5 KR46_May_metabat.bin.5 38.8135260
## 6 KR46_June_concoct_bin.26 0.3230628
## KR46_May.sorted.Relative.Abundance....
## 1 0.7948101
## 2 0.2796628
## 3 10.2860890
## 4 0.9776185
## 5 50.1354180
## 6 0.1859733
## KR46_Sept.sorted.Relative.Abundance....
## 1 2.6669740
## 2 0.8485058
## 3 3.2700922
## 4 4.1820410
## 5 31.9509660
## 6 0.3885868
## Let's rename the column headers in CoverM - you should use your own sample names.
colnames(CoverM)[1]="genome"
colnames(CoverM)[2]="KR46_June"
colnames(CoverM)[3]="KR46_May"
colnames(CoverM)[4]="KR46_Sept"
head(CoverM)
## genome KR46_June KR46_May KR46_Sept
## 1 unmapped 0.1982570 0.7948101 2.6669740
## 2 KR46_June_concoct_bin.39 1.8877897 0.2796628 0.8485058
## 3 KR46_May_metabat.bin.4 6.2776270 10.2860890 3.2700922
## 4 KR46_June_concoct_bin.11 1.5452732 0.9776185 4.1820410
## 5 KR46_May_metabat.bin.5 38.8135260 50.1354180 31.9509660
## 6 KR46_June_concoct_bin.26 0.3230628 0.1859733 0.3885868
Now we’ll join some of our data tables together to summarise our results.
# Create a summary from the data tables "Cdb_tab", "BinQ_tab" and "gtdb_tax".
# The tables are being joined by the genome name in the column "genome".
MAG_summary <-
as_tibble(left_join(Cdb_tab, BinQ_tab, "genome") %>%
left_join(., gtdb_tax, "genome"))
head(MAG_summary)
## # A tibble: 6 × 14
## genome secondary_cluster completeness contamination length N50
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 KR46_June_concoct_b… 1_1 94.2 0.52 2.10e6 13050
## 2 KR46_May_concoct_bi… 1_1 75.7 1.25 1.70e6 2736
## 3 KR46_Sept_concoct_b… 1_1 92.5 0.2 2.08e6 8728
## 4 KR46_June_concoct_b… 2_1 88.4 1.91 2.23e6 7934
## 5 KR46_May_concoct_bi… 2_1 88.8 1.72 2.03e6 14518
## 6 KR46_Sept_concoct_b… 2_1 100. 2.68 2.34e6 50449
## # ℹ 8 more variables: classification <chr>, domain <chr>, phylum <chr>,
## # class <chr>, order <chr>, family <chr>, genus <chr>, species <chr>
# Create a summary from the data tables "Cdb_tab", "BinQ_tab", "gtdb_tax" and "CoverM.
# The tables are being joined by the genome name in the column "genome".
MAG_rep_summary <-
as_tibble(left_join(Rep_bins, BinQ_tab, "genome") %>%
left_join(., gtdb_tax, "genome") %>%
left_join(., CoverM, "genome"))
head(MAG_rep_summary)
## # A tibble: 6 × 18
## genome cluster score completeness contamination length N50 classification
## <chr> <chr> <dbl> <dbl> <dbl> <int> <int> <chr>
## 1 KR46_Ju… 1_1 93.7 94.2 0.52 2.10e6 13050 d__Archaea;p_…
## 2 KR46_Se… 2_1 89.0 100. 2.68 2.34e6 50449 d__Bacteria;p…
## 3 KR46_Ma… 3_1 103. 100. 0.01 3.41e6 133245 d__Bacteria;p…
## 4 KR46_Ju… 4_0 63.6 79.2 3.28 2.39e6 3224 d__Bacteria;p…
## 5 KR46_Ju… 5_1 101. 100 0.4 3.74e6 174145 d__Bacteria;p…
## 6 KR46_Ma… 6_1 88.3 98.5 2.56 4.64e6 161636 d__Bacteria;p…
## # ℹ 10 more variables: domain <chr>, phylum <chr>, class <chr>, order <chr>,
## # family <chr>, genus <chr>, species <chr>, KR46_June <dbl>, KR46_May <dbl>,
## # KR46_Sept <dbl>
Now we’ve created our data tables, let’s move on to making some simple plots.
Data tables:
MAG_rep_summary : summary of
representative MAGsMAG_summary : summary of all MAGs
without dereplicationIf you want to save these tables (e.g., for use in Excel) you can
save them with
write_delim(MAG_rep_summary, "MAG_rep_summary.tsv") and a
tab separated file will be saved to your working directory.
Let’s start with the relative abundance data from CoverM. To create plots in ggplot2, it’s easier to have the data in “long format”. We’ll look at this below.
# Remind ourselves what data we have in the table "MAG_rep_summary"
colnames(MAG_rep_summary)
## [1] "genome" "cluster" "score" "completeness"
## [5] "contamination" "length" "N50" "classification"
## [9] "domain" "phylum" "class" "order"
## [13] "family" "genus" "species" "KR46_June"
## [17] "KR46_May" "KR46_Sept"
# To create a "long format" table from a "wide format" table, we can use the command `pivot_longer`
# My CoverM abundance data is in the columns KR46_May, KR46_June and KR46_Sept, i.e., columns 16:18 above.
# In the "pivot_longer" command below, I've selected those columns. If you have more samples, you will need to adjust the column numbers (16:18) to select all of your sample data.
MAG_rep_summary_l <-
MAG_rep_summary %>%
pivot_longer(16:18, names_to = "sample", values_to = "abundance")
# What is "long format" and "wide format". It's easier to visualise:
# Example of wide format - the abundance data for each sample is in it's own column.
MAG_rep_summary %>%
select(genome, KR46_May, KR46_June, KR46_Sept)
## # A tibble: 12 × 4
## genome KR46_May KR46_June KR46_Sept
## <chr> <dbl> <dbl> <dbl>
## 1 KR46_June_concoct_bin.11 0.978 1.55 4.18
## 2 KR46_Sept_concoct_bin.14 3.61 3.31 10.3
## 3 KR46_May_metabat.bin.5 50.1 38.8 32.0
## 4 KR46_June_concoct_bin.7 0.210 0.280 0.216
## 5 KR46_June_metabat.bin.10 22.5 24.2 19.2
## 6 KR46_May_metabat.bin.4 10.3 6.28 3.27
## 7 KR46_Sept_concoct_bin.23 0.340 0.0784 2.97
## 8 KR46_June_concoct_bin.39 0.280 1.89 0.849
## 9 KR46_May_concoct_bin.3 8.63 16.1 14.3
## 10 KR46_June_concoct_bin.26 0.186 0.323 0.389
## 11 KR46_June_metabat.bin.21 0.636 6.81 0.325
## 12 KR46_May_concoct_bin.11 1.45 0.194 9.41
# Example of long format - the abundance data for every sample is in the column "abundance". We therefore also have an additional column called "sample".
MAG_rep_summary_l %>%
select(genome, sample, abundance)
## # A tibble: 36 × 3
## genome sample abundance
## <chr> <chr> <dbl>
## 1 KR46_June_concoct_bin.11 KR46_June 1.55
## 2 KR46_June_concoct_bin.11 KR46_May 0.978
## 3 KR46_June_concoct_bin.11 KR46_Sept 4.18
## 4 KR46_Sept_concoct_bin.14 KR46_June 3.31
## 5 KR46_Sept_concoct_bin.14 KR46_May 3.61
## 6 KR46_Sept_concoct_bin.14 KR46_Sept 10.3
## 7 KR46_May_metabat.bin.5 KR46_June 38.8
## 8 KR46_May_metabat.bin.5 KR46_May 50.1
## 9 KR46_May_metabat.bin.5 KR46_Sept 32.0
## 10 KR46_June_concoct_bin.7 KR46_June 0.280
## # ℹ 26 more rows
One simple way to view abundance data is with a stacked bar plot. We can do this with ggplot2, which is a package included in tidyverse.
# Using the data table "MAG_rep_summary_l", we want "sample" on the x axis, "abundance" on the y axis, and the bars to be filled with colour by "genus".
stack_plot1 <-
ggplot(MAG_rep_summary_l, aes(x = sample, y = abundance, fill = genus)) +
geom_bar(stat = "identity")
stack_plot1
This looks great in this example where we only have 12 genera, but if you have many more samples or groupings it can become hard to read. A bubble plot is another way to present abundance data and it works better with more samples.
# Let's use the same data as above, but plot with "geom_point" instead of "geom_bar"
bubble_plot1 <-
MAG_rep_summary_l %>%
ggplot(aes(x = sample, y = genome, size = abundance, fill = genus)) +
geom_point(shape = 21, alpha = 1)
bubble_plot1
This looks ok but there are a few things we could do to improve it and make it easier to read.
# Plot samples in the correct order: by date in this example
MAG_rep_summary_l$sample <-
factor(MAG_rep_summary_l$sample,
levels = c("KR46_May", "KR46_June", "KR46_Sept"))
# Use "facet_grid" to group samples by phylum
MAG_rep_summary_l %>%
ggplot(aes(x = sample, y = genome, size = abundance, fill = phylum)) +
geom_point(shape = 21, alpha = 1) +
scale_size_area(max_size = 10) +
facet_grid(rows = vars(phylum), scales = "free", space = "free") +
theme(strip.text.y = element_text(colour = NA),
strip.background.y = element_blank())
# Alternatively, reorder by the variable phylum, so like colours are grouped together
bubble_plot2 <-
MAG_rep_summary_l %>%
mutate(genome = fct_reorder(genome, desc(phylum))) %>%
ggplot(aes(x = sample, y = genome, size = abundance, fill = phylum)) +
geom_point(shape = 21, alpha = 1) +
scale_size_area(max_size = 10)
bubble_plot2
# You can also use "tab" to look through the different "theme" options. Let's try "theme_linedraw" and increase the maximum size of our bubbles with the option "scale_size_area".
bubble_plot3 <-
MAG_rep_summary_l %>%
mutate(genome = fct_reorder(genome, desc(phylum))) %>%
ggplot(aes(x = sample, y = genome, size = abundance, fill = phylum)) +
geom_point(shape = 21, alpha = 1) +
scale_size_area(max_size = 15) +
theme_linedraw()
bubble_plot3
There are many other ways you can change the look of your plots. If you google “ggplot2” you’ll find endless tutorials and tips. Next let’s take a look at some of our functional annotations from METABOLIC.
We’ll start by getting our METABOLIC data into a more friendly format.
# We're selecting the columns "Category", "Gene.abbreviation", and "SAMPLE.Hmm.presence"
# Converting to long format with "pivot_longer"
# Then removing ".Hmm.presence" from the end of our sample names. You'll need to do this for each of your samples.
metab_May <-
metabolism_KR46_May %>%
select(Category, Gene.abbreviation, ends_with("presence")) %>%
pivot_longer(!c(1:2), names_to = "genome", values_to = "presence") %>%
mutate(genome = str_remove(genome, ".Hmm.presence"))
metab_June <-
metabolism_KR46_June %>%
select(Category, Gene.abbreviation, ends_with("presence")) %>%
pivot_longer(!c(1:2), names_to = "genome", values_to = "presence") %>%
mutate(genome = str_remove(genome, ".Hmm.presence"))
metab_Sept <-
metabolism_KR46_Sept %>%
select(Category, Gene.abbreviation, ends_with("presence")) %>%
pivot_longer(!c(1:2), names_to = "genome", values_to = "presence") %>%
mutate(genome = str_remove(genome, ".Hmm.presence"))
# Now we can join each of the sample files together.
# If you have more than 3, list all of your samples names in the brackets after rbind).
# We're also replacing the word "Present" with the number "1" and the word "Absent" with the number "0".
Metabolism_summary <-
rbind(metab_May, metab_June, metab_Sept) %>%
mutate(presence = case_when(presence == "Present" ~ 1,
presence == "Absent" ~ 0))
# Convert "presence" column from continuous (0 to 1) to discrete (0 and 1).
Metabolism_summary$presence <- as.factor(Metabolism_summary$presence)
# "Metabolism_summary" contains the annotation for all MAGs from each sample, so we'll also create a summary of just the representative MAGs by filtering based on the "genome" column in the table "Rep_bins" which holds our representative bins from dRep.
Metabolism_rep_summary <-
Metabolism_summary %>%
filter(genome %in% Rep_bins$genome)
head(Metabolism_rep_summary)
## # A tibble: 6 × 4
## Category Gene.abbreviation genome presence
## <chr> <chr> <chr> <fct>
## 1 Thermophilic specific rgy KR46_… 0
## 2 Thermophilic specific rgy KR46_… 0
## 3 Thermophilic specific rgy KR46_… 0
## 4 Thermophilic specific rgy KR46_… 0
## 5 Amino acid utilization 4-aminobutyrate aminotransferase and r… KR46_… 0
## 6 Amino acid utilization 4-aminobutyrate aminotransferase and r… KR46_… 1
Now we have our METABOLIC gene annotations summarised in:
Metabolism_summary : summary of gene
presence/absence for all MAGs from all samplesMetabolism_rep_summary : summary of
gene presence/absence for just representative genomes# Let's see what metabolism categories we have to choose from in our METABOLIC output
Metabolism_summary %>%
select(Category) %>%
distinct() %>%
print(n = 30)
## # A tibble: 25 × 1
## Category
## <chr>
## 1 Thermophilic specific
## 2 Amino acid utilization
## 3 Ethanol fermentation
## 4 Fatty acid degradation
## 5 Aromatics degradation
## 6 Complex carbon degradation
## 7 Fermentation
## 8 C1 metabolism
## 9 Methane metabolism
## 10 Carbon fixation
## 11 Nitrogen cycling
## 12 Sulfur cycling
## 13 Hydrogenases
## 14 Oxidative phosphorylation
## 15 Oxygen metabolism (Oxidative phosphorylation Complex IV)
## 16 Urea utilization
## 17 Halogenated compound utilization
## 18 Perchlorate reduction
## 19 Chlorite reduction
## 20 As cycling
## 21 Selenate reduction
## 22 Nitrile hydration
## 23 Iron cycling
## 24 Manganese cycling
## 25 Sulfur cycling enzymes (detailed)
# In this example, I'm going to choose the category "Sulfur cycling" to look at in more detail.
sulfur_cycle <-
Metabolism_rep_summary %>%
filter(Category == "Sulfur cycling")
# And create a simple heatmap, or presence/absence plot. Again we're using ggplot, but this time we select "geom_tile"
gene_plot1 <-
sulfur_cycle %>%
ggplot(aes(x = Gene.abbreviation, y = genome, fill = presence)) +
geom_tile(colour = "black") +
scale_fill_manual(values = c("white", "blue")) +
scale_x_discrete(expand=c(0, 0)) +
scale_y_discrete(expand=c(0, 0)) +
coord_fixed() +
theme(axis.text.x = element_text(angle = 90))
gene_plot1
This plot shows us genes for “Sulfur cyling” on the x-axis, but it could also be useful to use the summary of different functions provided on the second worksheet from METABOLIC.
# Look at the columns headers for worksheet2 from METABOLIC
colnames(function_KR46_May)
## [1] "Category"
## [2] "Function"
## [3] "Gene.abbreviation"
## [4] "KR46_May_concoct_bin.1.Function.presence"
## [5] "KR46_May_concoct_bin.11.Function.presence"
## [6] "KR46_May_concoct_bin.3.Function.presence"
## [7] "KR46_May_concoct_bin.32.Function.presence"
## [8] "KR46_May_metabat.bin.12.Function.presence"
## [9] "KR46_May_metabat.bin.4.Function.presence"
## [10] "KR46_May_metabat.bin.5.Function.presence"
# Put it into a more friendly format: again we're converting to long format, and removing the string ".Function.presence" from the end of our sample names.
func_May <-
function_KR46_May %>%
pivot_longer(!c(1:3), names_to = "genome", values_to = "presence") %>%
mutate(genome = str_remove(genome, ".Function.presence"))
func_June <-
function_KR46_June %>%
pivot_longer(!c(1:3), names_to = "genome", values_to = "presence") %>%
mutate(genome = str_remove(genome, ".Function.presence"))
func_Sept <-
function_KR46_Sept %>%
pivot_longer(!c(1:3), names_to = "genome", values_to = "presence") %>%
mutate(genome = str_remove(genome, ".Function.presence"))
# Join all of our samples together into one summary. We'll also replace the words "Present" and "Absent" with the numbers "1" and "0" again.
Function_summary <-
rbind(func_May, func_June, func_Sept) %>%
mutate(presence = case_when(presence == "Present" ~ 1,
presence == "Absent" ~ 0))
# Convert "presence" column from continuous (0 to 1) to discrete (0 and 1).
Function_summary$presence <- as.factor(Function_summary$presence)
# "Function_summary" contains the function annotation for all MAGs from each sample, so let's also create a summary of just the representatives by filtering based on the "genome" column in the table "Rep_bins" which holds our representative bins from dRep.
Function_rep_summary <-
Function_summary %>%
filter(genome %in% Rep_bins$genome)
Now we have our METABOLIC function annotations summarised in:
Metabolism_summary : summary of
function presence/absence for all MAGs from all samplesMetabolism_rep_summary : summary of
function presence/absence for just representative genomes# Plot everything
func_plot1 <-
Function_rep_summary %>%
ggplot(aes(x = genome, y = Function, fill = presence)) +
geom_tile(colour = "black") +
scale_fill_manual(values = c("white","purple")) +
coord_fixed() +
scale_x_discrete(expand=c(0, 0)) +
scale_y_discrete(expand=c(0, 0)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
func_plot1
That’s a bit hard to read. Let’s select some metabolisms we’re interested in based on the “Categories” provided by METABOLIC.
# Select metabolic categories and plot again. This time were using "facet_grid" to group the columns by the categories we've chosen.
func_plot2 <-
Function_rep_summary %>%
filter(Category == "Sulfur cycling" | Category == "Methane metabolism" | Category == "Carbon fixation") %>%
ggplot(aes(x = Function, y = genome, fill = presence)) +
geom_tile(colour = "black") +
scale_fill_manual(values = c("white", "purple")) +
facet_grid(cols = vars(Category), scales = "free", space = "free") +
scale_x_discrete(expand=c(0, 0)) +
scale_y_discrete(expand=c(0, 0)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
func_plot2
It would be nice to also include some taxonomy information. We can
get that information from our
MAG_rep_summary table.
# Join "Function_rep_summary" with "MAG_rep_summary" by the common column "genome".
genome_tab <-
left_join(Function_rep_summary, MAG_rep_summary, by = "genome")
# Now we can create the same plot using the joined table we just created called "genome_tab". We'll use "facet_grid" to group the rows by the variable "phylum".
func_plot3 <-
genome_tab %>%
filter(Category == "Sulfur cycling" | Category == "Methane metabolism" | Category == "Carbon fixation") %>%
ggplot(aes(x = Function, y = genome, fill = presence)) +
geom_tile(colour = "black") +
scale_fill_manual(values = c("white", "purple")) +
facet_grid(cols = vars(Category), rows = vars(phylum), scales = "free", space = "free") +
scale_x_discrete(expand=c(0, 0)) +
scale_y_discrete(expand=c(0, 0)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
strip.text.y = element_text(angle = 0))
func_plot3
You can play around with the ggplot settings to improve the appearance of your plots and/or beautify your final plots for publication in a program like InkScape.
To save any plot you generate, you can use the command
ggsave(filename = "YOUR_PLOT_NAME.svg", plot = YOUR_PLOT).
You can also change the size of the file you save by adding
width = and length=, or save in different
formats by changing the extension of your file name (e.g.,
.png or .pdf).
The package ComplexHeatmap can also hold many more different variables. Let’s have a go.
ComplexHeatmap likes the data to be in a matrix, so let’s start by creating a matrix with our CoverM abundance data.
# Remind ourselves what is in the data table holding our CoverM abundance data
colnames(MAG_rep_summary)
## [1] "genome" "cluster" "score" "completeness"
## [5] "contamination" "length" "N50" "classification"
## [9] "domain" "phylum" "class" "order"
## [13] "family" "genus" "species" "KR46_June"
## [17] "KR46_May" "KR46_Sept"
# My CoverM abundance data is in columns "KR46_June", "KR46_May", "KR46_Sept" i.e., column numbers 16, 17, 18.
# Now we'll create a matrix using those columns
# I've selected them in the order I want them to appear (17 = May, 16 = June, 18 = July)
mat1 <-
as.matrix(MAG_rep_summary %>%
select(17, 16, 18))
head(mat1)
## KR46_May KR46_June KR46_Sept
## [1,] 0.9776185 1.5452732 4.1820410
## [2,] 3.6091943 3.3117273 10.3326100
## [3,] 50.1354180 38.8135260 31.9509660
## [4,] 0.2099092 0.2796721 0.2164297
## [5,] 22.4642030 24.1804350 19.1765670
## [6,] 10.2860890 6.2776270 3.2700922
# Now we'll create column annotations
# In our matrix above (mat1) the column headers are "KR46_May", "KR46_June", "KR46_Sept".
# I'm going to create a "column annotation" for those samples, here I'll use months, but it could be any metadata variable.
# The annotations must be in the same order as your matrix
col_anno1 <- data.frame(month = c("May", "June", "Sept"))
# We can also define which colours we'd like to use for a variable
ann_colors = list(
month = c(May = "#D3F6DB", June = "#92D5E6", Sept = "#772D8B"))
# The rows in our matrix don't currently have a row name.
# This will add the names of our genomes to the row name.
rownames(mat1) = paste(MAG_rep_summary$genome)
head(mat1)
## KR46_May KR46_June KR46_Sept
## KR46_June_concoct_bin.11 0.9776185 1.5452732 4.1820410
## KR46_Sept_concoct_bin.14 3.6091943 3.3117273 10.3326100
## KR46_May_metabat.bin.5 50.1354180 38.8135260 31.9509660
## KR46_June_concoct_bin.7 0.2099092 0.2796721 0.2164297
## KR46_June_metabat.bin.10 22.4642030 24.1804350 19.1765670
## KR46_May_metabat.bin.4 10.2860890 6.2776270 3.2700922
# Now we'll create row annotations
# Let's use the "phylum" column from "MAG_rep_summary".
row_anno1 <- data.frame(phylum = MAG_rep_summary$phylum)
# And finally, draw a heatmap using our matrix "mat1" and the column and row annotations we created.
heat1 <- pheatmap(mat1,
cluster_rows = F, cluster_cols = F,
heatmap_legend_param = list(title = "Relative abundance (%)"),
annotation_row = row_anno1,
annotation_col = col_anno1,
annotation_colors = ann_colors)
heat1
You can create as many row and column annotations as you want.
ComplexHeatmap can also integrate bar plots. Let’s create some bar plots
using the “Completeness” and “Contamination” value for our MAGs found in
the table MAG_rep_summary.
# create bar plot of completeness
comp <- rowAnnotation(completeness = anno_barplot(MAG_rep_summary$completeness, gp = gpar(fill = "pink")),
annotation_name_side = c("bottom"),
annotation_name_rot = c(90))
# create bar plot of contamination
con <- rowAnnotation(contamination = anno_barplot(MAG_rep_summary$contamination, gp = gpar(fill = "yellow")),
annotation_name_side = c("bottom"),
annotation_name_rot = c(90))
# Draw these barplots with the heatmap we created above
draw(comp + con + heat1)
Looks good? Great! The purpose of this lesson was to give you some ideas of how to present your data. As you can see there are many ways to do it and lots of different settings. I hope this was a useful starting off point.
Next we’ll generate a phylogenomic tree using GToTree. We’ll do this on SCITAS following the instructions on GitHub, but you may find it useful to look at some of the data tables we’ve generated to choose which MAGs to include in your phylogenomic tree so let’s save our GTDB taxonomy table:
# Save "gtdb_tax" as a tsv file and open it in Excel
write_delim(gtdb_tax, "gtdb_taxonomy.tsv", delim = "\t")