ShotgunMG v1.3.2 user guide

Content

This document is divided into the following sections:
1) Global overview
2) Workflow walkthrough
3) Smart restart mechanism
4) End results
5) Information regarding the job ou:wtput folder where standard output and standard error of each job are written
6) Integrating more options in a given bioinformatics package

All files being referred to in this document can be found in the companion docker image: https://cloud.docker.com/u/julio514/repository/docker/julio514/centos The first step should be to install this image on your computer.

1- Getting started

Global overview

ShotgunMG is a bioinformatics pipeline geared to process short reads metagenomic sequencing data on High Performance Computing (HPC) environments. It is intended for experienced bioinformaticians with experience on a Linux-based OS (i.e. command line) and a HPC environment, including interacting with a job scheduler, in our case, SLURM or SGE. The pipeline relies on the GenPipes Workflow Management System (WMS) (https://doi.org/10.1093/gigascience/giz037). The pipeline’s methodology is described in details (among others) in the following preprint (https://www.biorxiv.org/content/10.1101/2022.04.19.488797v3)

Figure 1. Overview of ShotgunMG. 1) Reads of each library are controlled for quality. 2) Quality controlled reads are co-assembled into one single de novo assembly. Gene coordinates are computed on each contig. Quality controlled reads are mapped on the co-assembly to estimate contig and gene abundance. 3) Contig and gene abundance are summarized into abundance matrices where columns = samples/liraries and rows = contig or gene identifiers. 4) Genes are annotated for taxonomy and functions and compiled on one single database (5). 6) These end results can then be used for downstream analyses.

Figure 1. Overview of ShotgunMG. 1) Reads of each library are controlled for quality. 2) Quality controlled reads are co-assembled into one single de novo assembly. Gene coordinates are computed on each contig. Quality controlled reads are mapped on the co-assembly to estimate contig and gene abundance. 3) Contig and gene abundance are summarized into abundance matrices where columns = samples/liraries and rows = contig or gene identifiers. 4) Genes are annotated for taxonomy and functions and compiled on one single database (5). 6) These end results can then be used for downstream analyses.

The pipeline requires the 3 following elements in order to be functional:
1) The pipeline itself (nrc_pipeline_public): https://bitbucket.org/jtremblay514/nrc_pipeline_public/src/master/ (or v1.3.2)
2) A set of heteregenous Perl and R scripts (nrc_tools_public): https://bitbucket.org/jtremblay514/nrc_tools_public/src/master/ (or v1.3.2)
3) An extensive set of third-party software listed below and the versions used and tested with the pipeline. Installation details of each of these packages is also available (nrc_resources_public): https://bitbucket.org/jtremblay514/nrc_resources_public/src/master/ (or v1.3.2)
List of third party software:

trimmomatic/0.39;       megahit/1.2.9;     metabat/2.12.1;
perl/5.26.0;            bbmap/38.11;       python/3.9.0;
R/4.2.1;                java/jdk1.8.0_144; blast/2.13.0+;
bedtools/2.23 and 2.29; barrnap/0.9;       hmmer/3.3.2;
metabat/2.12.1;         maxbin/2.7;        prodigal/2.6.3;
CAT/5.2.3;              RTK/0.93.2;        trnascan-se/2.0.7;
exonerate/2.2.0;        checkm/1.1.3;      pplacer/1.1;
quast/5.0.2;            fraggenescan/1.31; spades/3.15.0;
trnascan-se/2.0.7;      infernal/1.1.3;    kofamscan/1.3.0;
parallel/20220322;      pigz/2.3.4;        diamond/2.0.15

Installation of all the required libraries and external packages is a time consuming task and for the purpose of this guide, we’ll refer the reader to the CentOS 7 image that we provide, which includes all the necessary packages needed to properly execute ShotgunMG.

The pipeline has been written in such a way that the user can use various software packages in the workflow. The pipeline supports two shotgun metagenome assemblers (MEGAHIT and meta-SPAdes), two reads mapper (bwa-mem and bbmap) and two metagenome binning methods (MetaBAT2 and MaxBin2).

Setting up the databases

The databases needed by ShotgunMG were not included in the Docker image, because of size constraints. The following databases should be manually installed:

CAT

Go here - https://github.com/dutilh/CAT - and follow the instructions under the preconstructed databases section,

CheckM

Follow the instructions : https://github.com/Ecogenomics/CheckM

NCBI nr

ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz (Do not forget to run diamond makedb --in nr -p 4 --db nr.dmnd once downloaded).

PFAM-A

Available here: http://ftp.ebi.ac.uk/pub/databases/Pfam. Use the latest version. May have to run hmmpress once downloaded.

Contaminants

Available here: http://jtremblay.github.io/files/contaminants.tar.gz. Contains known Illumina contaminants, other sequencing artefacts and adapter sequences from various kits (NexteraXT, TruSeq, etc.). In fasta format.

KEGG

KEGG orthologs (KO) assignment is done using kofamscan available here: https://www.genome.jp/ftp/tools/kofam_scan/kofam_scan-1.3.0.tar.gz. In order to run kofamscan, you will need to have the HMM profiles of each KO - available here: https://www.genome.jp/ftp/db/kofam/profiles.tar.gz - and also the KO link file - available here: https://www.genome.jp/ftp/db/kofam/ko_list.gz. Another file needed to link each KO to their associated pathway and/or module is available here: https://https://jtremblay.github.io/files/kegg_ref_pathways_modules_combined.tsv.gz. Once downloaded, uncompress these files and move them into their location which should be $INSTALL_HOME/databases/kofam//. Double check their path in the .ini file under the [kofamscan] and [parse_kofam] sections. KOfamscan generates lots of intermediate files which can actually make it impossible to use for large metagenomes. To circumvent this issue, you can concatenate all kofam individual profiles (cat ./profiles/K* > kofam.hmm and run hmmpress hmmpress kofam.hmm). Once done this kofam.hmm file can be used with the hmmsearch software - see KEGG section downstream. Our internal benchmarks showed that hmmsearch and kofamscan gave identical results (i.e. for both methods, each gene pointed to the same KO).

Setting up files needed by the pipeline

The first step in running ShotgunMG is to setup the files required by the pipeline to run. Fastq libraries usually come in the form of demultiplexed paired end sequencing libraries - one library per sample. These .fastq.gz files should be stored in a directory labeled raw_reads/. The ShotgunMG.ini and readset.tsvfiles are needed as well (more on this below).

Once the docker image is properly installed and loaded, we can go to the following directory

cd $INSTALL_HOME
root@centos7image:/project/microbiome_genomics$ pwd
/project/microbiome_genomics

Next we navigate to the projects directory where we have two directories:

>cd projects
>ls -l
mock_community_shotgunMG

The folder mock_community_shotgunMG contains all the files necessary for the execution of pipeline. The raw data is from mock community samples (HM-782D and HM-783D, BEI resources, Manassas, VA, USA) prepared for shotgun metagenomics libraries (NexteraXT) and sequenced on a HiSeq2500 system on a 2x100 configuration. The ShotgunMG.ini file is configured to run a trimmomatic/megahit/metabat2 workflow.

root@centos7image:/project/microbiome_genomics/projects$ cd mock_community_shotgunMG/ 
root@centos7image:/project/microbiome_genomics/projects/mock_community_shotgunMG$ ls -l
total 52
-rwxrwxr-x 1 root    root    12175 Feb  6 20:17 ShotgunMG.ini
drwxrwxr-x 1 root    root     4096 Feb  7 20:06 job_output
-rw-r--r-- 1 root    root     3632 Feb  4 20:08 mapping_file.tsv
drwxrwsr-x 1 root    root     4096 Feb  1 15:45 raw_reads
-rwxr-x--- 1 root    root     1261 Feb  6 20:41 workflow.sh

The two inputs we need in order to launch the pipeline are :
1) the raw fastq.gz reads, which should all be stored in the raw_reads/ directory. For paired_end libraries, all files should be labelled *_R1.fastq.gz and *_R2.fastq.gz. For single end files, each library should end with _R1.fastq.gz. No specific nomenclature is expected as long as the file names follow these minimal specifications.

>ls -l raw_reads/
-rw-r--r-- 1 root root   9211851 May 27 20:09 Microbial_Mock_Community_Even_A_R1.fastq.gz
-rw-r--r-- 1 root root   9674881 May 27 20:09 Microbial_Mock_Community_Even_A_R2.fastq.gz
-rw-r--r-- 1 root root   7828069 May 27 20:08 Microbial_Mock_Community_Even_B_R1.fastq.gz
-rw-r--r-- 1 root root   8072334 May 27 20:08 Microbial_Mock_Community_Even_B_R2.fastq.gz
-rw-r--r-- 1 root root   9416489 May 27 20:09 Microbial_Mock_Community_Even_C_R1.fastq.gz
-rw-r--r-- 1 root root   9915486 May 27 20:09 Microbial_Mock_Community_Even_C_R1.fastq.gz
-rw-r--r-- 1 root root   5080506 May 27 20:08 Microbial_Mock_Community_Staggered_A_R1.fastq.gz
-rw-r--r-- 1 root root   5586795 May 27 20:08 Microbial_Mock_Community_Staggered_A_R2.fastq.gz
-rw-r--r-- 1 root root   5080506 May 27 20:08 Microbial_Mock_Community_Staggered_B_R1.fastq.gz
-rw-r--r-- 1 root root   5586795 May 27 20:08 Microbial_Mock_Community_Staggered_B_R2.fastq.gz
-rw-r--r-- 1 root root   5080506 May 27 20:08 Microbial_Mock_Community_Staggered_C_R1.fastq.gz
-rw-r--r-- 1 root root   5586795 May 27 20:08 Microbial_Mock_Community_Staggered_C_R2.fastq.gz

We left traces of the relevant commands in the workflow.sh file which contains the two following commands:

>cat workflow.sh
module load nrc/nrc_tools/1.3

# Preprocessing raw reads
generateReadsetSheet.pl --indir ./raw_reads > readset.tsv

# Run the pipeline. The pipeline should be run in 4 parts: 1) steps 1 to 5. 2) step 6 and 3) steps 7-15. 4) MAGs can then be generated with step 16.
# Steps 1-5
shotgunmg.py -c ./ShotgunMG.ini -s 1-5 -o . -j slurm -b barcodes.fasta -z ./pipeline.json  > ./commands.sh

The first command calls the generateReadsetSheet.pl script and will take all .fastq.gz files in the raw_reads/ directory and create a ./readset.tsv file that holds required info to launch the pipeline.


## Running the pipeline
We can run the pipeline wrapper:

>shotgunmg.py -h
Steps:
1- trim
2- remove_contam
3- assembly
4- gene_prediction
5- abundance
6- exonerate
7- kegg_annotation
8- rpsblast_cog
9- rpsblast_kog
10- hmmscan_pfam
11- diamond_blastp_nr
12- ncrna
13- taxonomic_annotation
14- beta_diversity
15- alpha_diversity
16- overrepresentation
17- finalize
18- binning

Steps 19 and onward are not officially supported.

shotgunmg.py -c ./ShotgunMG.ini -s 1-6 -o . -j slurm -z ./pipeline.json  > ./commands.sh

Here -s argument specifies which steps to run. Specifying -s 1-6 will run shotgunMG’s 1 to 6 steps:

1- trim
2- remove_contam
3- assembly
4- gene_prediction
5- abundance
6- exonerate

A selection of certain steps only can also be run:
For instance, -s 1-6 will instruct the pipeline to generate jobs for steps 1 to 6. Once ShotgunMG.py has been executed, all the jobs generated by the pipeline will be written in standard output or in the commands.sh file specified above. If the argument -j slurm is specified, all jobs will be written in such a way that they can be submitted to the SLURM job scheduler along with their job dependence requirement(s) and syntax, if any. If the -j batch argument is specified, the jobs will be written without the SLURM syntax so they can be run interactively on the terminal.
The -z ./pipeline.json argument specifies the json output file in which to store the pipeline structure, it can then be used for visualization purposes as shown here - http://jtremblay.github.io/PipelineViewer/ShotgunMG.html
The -c ./ShotgunMG.ini argument refers to the file that contains all the various jobs’ parameters for the pipeline and is described in details below.

shotgunmg.py and its corresponding libraries : bio/shotgun_metagenomics.py and bio/microbial_ecology.py

The nrc_pipeline_public repository contains the following structure:

├─ bio
│    ├── [156K]  shotgun_metagenomics.py
│    ├── [122K]  microbial_ecology.py
│    ├── [ 96K]  rrna_amplicons.py # For AmpliconTagger pipeline, not actually used in shotgunmg.py
│    ├── [1.0K]  sample.py
│    └── [ 671]  readset.py
├── core
│   ├── [5.2K]  config.py
│   ├── [7.6K]  job.py
│   ├── [ 14K]  pipeline.py
│   ├── [ 13K]  scheduler.py
│   └── [1.3K]  step.py
└── pipelines
    ├── metagenomics
    │   ├── [ 68K]  shotgunmg.py
    ├── [ 16K]  common.py
    └──llumina
        └── [4.4K]  illumina.py

It holds the Python libraries on which ShotgunMG depends, but the most relevant Python files to discuss here in the context of this guide are the following:

pipelines/metagenomics/shotgunmg.py
bio/shotgun_metagenomics.py
bio/microbial_ecology.py

shotgunmg.py contains the main pipeline workflow definition and will make calls to other Python libraries of the repository to create Job objects and manage the dependencies of each job, but the bulk of each job invoked by shotgunmg.py is defined in shotgunmg.py and microbial_ecology.py and each job defined in these two libraries will fetch their parameters from the ShotgunMG.ini file.

The .ini file.

The .ini file, in our case, ShotgunMG.ini contains job scheduler, software, database paths and third party packages specific parameters that will be used at each step of the pipeline. The first few lines of the .ini file contain general definitions mainly related to the job scheduler:

[DEFAULT]
cluster_submit_cmd=sbatch
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 1
cluster_pmem=--mem=12000
# change my_account_name with real account name if running on SLURM.
cluster_other_arg=--account=my_account_name --export=ALL --mail-type=END --mail-user=$JOB_MAIL
cluster_queue=
cluster_work_dir_arg=-D
cluster_output_dir_arg=-o
cluster_job_name_arg=-J
cluster_cmd_produces_job_id=true
# the clusterSubmitCmdSuffix must be set to | grep \"[0-9]\" if clusterCmdProducesJobId=true
cluster_submit_cmd_suffix= | grep -o "[0-9]*"
cluster_dependency_arg=-d afterok:
cluster_dependency_sep=:
raw_read_dir=./raw_reads
current_dir=./
job_output_dir=./jobs_output

For the SGE job scheduler, the .ini file would look something like this:

cluster_submit_cmd=jobsub
cluster_walltime=-l h_rt=12:00:0
cluster_cpu=-pe dev 1 -l res_cpus=1
cluster_pmem=-l res_mem=12000
cluster_other_arg=-P nrc_eme -V -j y -l res_image=nrc/nrc_all_default_ubuntu-18.04-amd64_latest
cluster_queue=
cluster_work_dir_arg=-wd
cluster_output_dir_arg=-o
cluster_job_name_arg=-N
cluster_cmd_produces_job_id=true
# the clusterSubmitCmdSuffix must be set to | grep \"[0-9]\" if clusterCmdProducesJobId=true
cluster_submit_cmd_suffix= | grep -o "[0-9]*"
#cluster_dependency_arg=-W depend=afterok:
cluster_dependency_arg=-hold_jid 
cluster_dependency_sep=,
extra_java_flags=-XX:ParallelGCThreads=4
raw_read_dir=./raw_reads
current_dir=./
job_output_dir=./jobs_output
# MANDATORY PARAMETERS FOR GPSC/GRIDENGINE
job_scripts_dir=./job_scripts
install_home=/space/project/grdi/eco/bioinfo-tools/nrc
env_modules=/etc/profile.d/modules.sh
tmpdir=/tmp # Not much space in /tmp, consider changing to a path were enough storage is available.

Here the parameters starting with cluster_ defines default values and other values required by SLURM. The raw_reads_dir specifies where are the raw fastq files located. The job_output_dir points to the location of all stderr output of each job.

The next chunk of parameters defines which modules will be invoked by the pipeline:

module_perl=nrc/perl/5.26.0
module_python2=nrc/python/2.7.18 # deprecated
module_python3=nrc/python/3.9.0
module_R=nrc/R/4.2.1
module_tools=nrc/nrc_tools/1.3.2
module_rdp_classifier=nrc/rdp_classifier/2.5
module_java=nrc/java/jdk1.8.0_171
module_blast=nrc/blast/2.13.1+
module_trimmomatic=nrc/trimmomatic/0.39
module_bwa=nrc/bwa/0.7.17
module_samtools=nrc/samtools/1.9
module_bedtools=nrc/bedtools/2.23.0
module_exonerate=nrc/exonerate/2.2.0
module_hmmer=nrc/hmmer/3.3.2
module_hdf5=nrc/hdf5/1.8.13
module_pigz=nrc/pigz/2.3.4
module_metabat2=nrc/metabat/2.12.1
module_prodigal=nrc/prodigal/2.6.3
module_checkm=nrc/checkm/1.1.3
module_bowtie2=nrc/bowtie2/2.3.2
module_megahit=nrc/megahit/1.2.9
module_pplacer=nrc/pplacer/1.1
module_diamond=nrc/diamond/2.0.15
module_bbmap=nrc/bbmap/38.11
module_CAT=nrc/CAT/5.2.3
module_maxbin2=nrc/maxbin/2.2.7
module_spades=nrc/SPAdes/3.15.0
module_quast=nrc/quast/5.2.0
module_trnascanse=nrc/trnascan-se/2.0.7
module_barrnap=nrc/barrnap/0.9
module_bedtoolsforbarrnap=nrc/bedtools/2.29.2
module_infernal=nrc/infernal/1.1.3
module_rtk=nrc/rtk/0.93.2
module_fraggenescan=nrc/fraggenescan/1.31

For more information on modules, the reader is invited to consult references on Environment Modules (http://modules.sourceforge.net/).

Next are general (but critical) parameters like sequencing reads configuration, file paths of sequencing adapters, databases, etc. The most consequential parameters are the following: assembler, mapper, binner and KO_method. ShotgunMG supports the Megahit and Spades assemblers, the bwa-mem and bbmap reads mapper and metabat2 and/or maxbin2 as MAGs/bins generators.

# qc_methods can only be: trimmomatic
qc_methods=trimmomatic
# mapper can be one of these: bwa(i.e. bwa-mem) or bbmap
mapper=bbmap
# assembler can be one of these: megahit or spades.
assembler=megahit
# MAGs generation. Can be one of these: metabat2 or maxbin2 or both with a ":" delimiter.
binner=metabat2
# Annotations. kog is more focused on eukaryotes, so it can be skipped if eukaryotes are not the focus
skip_cog=no
skip_kog=yes
skip_blastp_nr=no
# can be either <hmmsearch_kofam> or <kofamscan>. <diamond_blastp> can also be used if you have access to the private KEGG GENES database.
KO_method=hmmsearch_kofam

Next are database paths:

[DB]
#if METAGENOME
contaminants=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa
#if METATRANSCRIPTOME
#contaminants=$INSTALL_HOME/databases/RDP_training_sets/greengenes/2014-08-11/gg_13_5_withTaxonomy_corrected_silva_r117_euk_mito_chloro_sortedlength_99pcid_Illumina_contaminants.fasta

And next are all the parameters of each job generated by the pipeline. For instance we have the following values under the [trim] and [bbduk] sections:

[trim]
threads=6
trailing_min_quality=30
average_quality=30
min_length=45
# Have a look at the fastqc profiles to make sure you are cropping enough bases at the beginning of the reads.
headcrop=16
# crop is optional. If for some reason you are still stuck with overrepresented kmers at the end, use crop=<in> to force them off.
crop=96
adapter_fasta=$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa
illumina_clip_settings=:2:10:10
cluster_walltime=-t 6:00:0
cluster_cpu=-N 1 -n 6
cluster_pmem=--mem=32000
.
.
.

Important parameters

The most important parameters for the pipeline are listed below. These parameters should always be doubled-checked before running the pipeline. Key parameters under the [DEFAULT] and [DB] categories are discussed above, but relisted here. Special attention should be observed for the [trim] parameters too as accurately qced reads is paramount to achieve a good assembly and consequently get accurate downstream results. It is strongly encourage to perform a FastQC analysis for at least one sequencing library before and after qc to make sure that reads going into co-assembly are of good quality.

[DB]
#if METAGENOME
contaminants=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa
#if METATRANSCRIPTOME
#contaminants=$INSTALL_HOME/databases/RDP_training_sets/greengenes/2014-08-11/gg_13_5_withTaxonomy_corrected_silva_r117_euk_mito_chloro_sortedlength_99pcid_Illumina_contaminants.fasta

phix=$INSTALL_HOME/databases/contaminants/phix174_NC_001422.1.fasta
## If Bacteria and archea
core=$INSTALL_HOME/databases/fasta/greengenes/core_set_aligned.fasta
rdp_training_set=$INSTALL_HOME/databases/RDP_training_sets/silva/2017-07-12/specie/rRNAClassifier.properties

Jobs and their dependencies

One key feature of ShotgunMG (through the underlying GenPipes engine) is its ability to generate jobs and manage the dependency network of each job. If we look at the first two steps representing three jobs that are being written in the commands.sh file, we see the following instructions:

#-------------------------------------------------------------------------------
# STEP: trim
#-------------------------------------------------------------------------------
STEP=trim
mkdir -p $JOB_OUTPUT_DIR/$STEP

#-------------------------------------------------------------------------------
# JOB: trim_1_JOB_ID: trimmomatic_Microbial_Mock_Community_Staggered_B
#-------------------------------------------------------------------------------
JOB_NAME=trimmomatic_Microbial_Mock_Community_Staggered_B
JOB_DEPENDENCIES=
JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Staggered_B.f72f4f889cb8f8cf9fcbfac1758b80b5.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
trim_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/java/jdk1.8.0_144 nrc/trimmomatic/0.39 && \

java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
  -threads 6 \
  -phred33 \
  /project/6008026/projects/misc/shotgunmg_paper/mock_community/./raw_reads/Microbial_Mock_Community_Staggered_B_R1.fastq.gz /project/6008026/projects/misc/shotgunmg_paper/mock_community/./raw_reads/Microbial_Mock_Community_Staggered_B_R2
  qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single1.fastq.gz qced_reads/Microbial_Moc
  ILLUMINACLIP:/project/6008026/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
  TRAILING:30 \
  SLIDINGWINDOW:4:15 \
  MINLEN:45 \
  HEADCROP:16 CROP:96 \
  2> qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward Only Surviving: (\d+).*$/Raw Fragments,\1\nFragment Surviving
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
sbatch --account=rrg-jtrembla --export=ALL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 6:00:0  --mem=32000 -N 1 -n 6  | grep -o "[0-9]*")
echo "$trim_1_JOB_ID    $JOB_NAME   $JOB_DEPENDENCIES   $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST

#-------------------------------------------------------------------------------
# STEP: remove_contam
#-------------------------------------------------------------------------------
STEP=remove_contam
mkdir -p $JOB_OUTPUT_DIR/$STEP

#-------------------------------------------------------------------------------
# JOB: remove_contam_1_JOB_ID: bbduk_paired_Microbial_Mock_Community_Staggered_B
#-------------------------------------------------------------------------------
JOB_NAME=bbduk_paired_Microbial_Mock_Community_Staggered_B
JOB_DEPENDENCIES=$trim_1_JOB_ID
JOB_DONE=job_output/remove_contam/bbduk_paired_Microbial_Mock_Community_Staggered_B.3da67d954e46b705fe615ca301d3b78a.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
remove_contam_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/bbmap/38.11 && \

bbduk.sh \
  in=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz \
  in2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz \
  out=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R1.fastq.gz \
  out2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R2.fastq.gz \
  outm=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R1.fastq \
  outm2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R2.fastq \
  stats=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.duk_contam_interleaved_log.txt \
  k=21 \
  minkmerhits=1 \
  ref=/project/6008026/databases/contaminants/Illumina.artifacts.fa \
  overwrite=true \
  threads=1
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
sbatch --account=rrg-jtrembla --export=ALL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 12:00:0  --mem=12000 -N 1 -n 1  -d afterok:$JOB_DEPENDENCIES | grep -o "[0-9]*")
echo "$remove_contam_1_JOB_ID   $JOB_NAME   $JOB_DEPENDENCIES   $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST

The first step STEP: trim of the pipeline contains a single job called trimmomatic_Microbial_Mock_Community_Staggered_B and will run the bbduk.sh script with proper parameters. This job contains no dependency: the variable JOB_DEPENDENCIES= is not assigned any values and there are no -d afterok:$JOB_DEPENDENCIES argument passed on to the sbatch command. This means that this job will be executed as soon as the job scheduler finds appropriate resources to execute it. In that particular case, we are asking for 12 GB of RAM --mem=32000, a walltime of 12 hrs -t 6:00:0 and 1 compute node and 1 core or cpu -N 1 -n 6. These values are specified in the ShotgunMG.ini file under the [trim] section.

The second step of the pipeline STEP: remove_contam contains one job labeled bbduk_paired_Microbial_Mock_Community_Staggered_B, which depends on the sucessful execution of the trimmomatic_Microbial_Mock_Community_Staggered_B job. In this case, the JOB_DEPENDENCY variable is assigned the $trim_1_JOB_ID value (JOB_DEPENDENCIES=$trim_1_JOB_ID) and the sbatch command contains the -d afterok:$JOB_DEPENDENCIES argument. This means that SLURM will wait for the $trim_1_JOB_ID to be sucessfully completed (i.e. exit_status=0) before submitting the bbduk_paired_Microbial_Mock_Community_Staggered_B job in the waiting queue.

How are these jobs generated in the shotgunmg.py script?

The script pipelines/metagenomics/shotgunmg.py contains the narrative of the pipeline and will create job objects relying on the bio/shotgun_metagenomics.py and bio/microbial_ecology.py Python libraries. For intance the first trimmomatic_Microbial_Mock_Community_Staggered_B job mentionned above is written on lines 77-109 inside the step trim:


```python
def trim(self):
     
  """
  step trim(): Raw fastqs will be trimmed using Trimmomatic. Interleaved fastqs will be generated after trimming. 
  """
     
  mkdir_p(os.path.join(self._root_dir, "qced_reads"))
   
  jobs = [] 
  trim_stats = [] 
     
  for readset in self.readsets: # The readset file was provided in the shotgunmg.py command as an argument where one line = one     readset
                                # The for loop ill iterate over all readset and generate a trim job for each readset.
    trim_file_prefix = os.path.join("qced_reads", readset.sample.name, readset.name + ".trim.")
    trim_file_prefix_long = os.path.join("qced_reads", readset.sample.name, readset.name + ".trim.long.")
    if not os.path.exists(os.path.join("qced_reads", readset.sample.name)):
      os.makedirs(os.path.join("qced_reads", readset.sample.name))
     
    if readset.run_type == "PAIRED_END": # If in the readset, the run type is PAIRED_END (majority of time)...

      job = shotgun_metagenomics.trimmomatic(  # will call the trimmomatic function inside the bio/trimmomatic.py library
        readset.fastq1,
        readset.fastq2,
        trim_file_prefix + "pair1.fastq.gz",
        trim_file_prefix + "single1.fastq.gz",
        trim_file_prefix + "pair2.fastq.gz",
        trim_file_prefix + "single2.fastq.gz",
        readset.quality_offset,
        trim_file_prefix + "out",
        trim_file_prefix + "stats.csv"
      )
      job.name = "trimmomatic_" + readset.sample.name
      job.subname = "trim"
      jobs.append(job) 

The above excerpt is just an example with one readset, but in reality, the trim step will generate one job per readset - so many more jobs. This job object is instanciated by calling the trimmomatic function inside the bio/shotgun_metagenomics.py Python library which is described below:

def trimmomatic(input1, input2, paired_output1, unpaired_output1, paired_output2, unpaired_output2, quality_offset, trim_log, trim_stats):

  job = Job( 
    [input1, input2], # Input files - essential to specify for the job dependency network
    [paired_output1, unpaired_output1, paired_output2, unpaired_output2, trim_log, trim_stats], #Output files - essential to specify for the job dependency network. This output file will somehow be used as input for the next job definition.
    [
      ['java', 'module_java'], 
      ['trimmomatic', 'module_trimmomatic']
    ] # modules are defined in the .ini file. In the current case, we ask for the module_tools=nrc/nrc_tools_public/dev  (defined in .ini file) to be loaded
  )    

  threads = config.param('trim', 'threads', type='posint')  # config.param calls values defined in the .ini fil, here withing [trim] category, threads=6 in the .ini file.
  adapter_file = config.param('trim', 'adapter_fasta', type='filepath') # config.param calls values defined in the .ini file. Here withing the [trim] categoryu, adapter_file=$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa (defined in the .ini file)
  illumina_clip_settings = config.param('trim', 'illumina_clip_settings')
  trailing_min_quality = config.param('trim', 'trailing_min_quality', type='int')
  min_length = config.param('trim', 'min_length', type='posint')
  headcrop = config.param('trim', 'headcrop', required=False, type='int')
  crop = config.param('trim', 'crop', required=False, type='int')
  sliding_window1 = config.param('trim', 'sliding_window1', required=False, type='int')
  sliding_window2 = config.param('trim', 'sliding_window2', required=False, type='int')
   
  if not isinstance( sliding_window1, int ): # Check if sliding_windo1 is defined in the .ini file, if not defined, assigned a default value.
    if not(sliding_window1 and sliding_window1.strip()):
      sliding_window1 = 4
  
  if not isinstance( sliding_window2, int ):
    if not(sliding_window2 and sliding_window2.strip()):
      sliding_window2 = 15 

    # Ppulate the trimmomatic software command itself
  job.command = """
java -XX:ParallelGCThreads={threads} -Xmx2G -jar \$TRIMMOMATIC_JAR {mode} \\
  -threads {threads} \\
  -phred{quality_offset} \\
  {input1} {input2} \\
  {paired_output1} {unpaired_output1} {paired_output2} {unpaired_output2} \\
  ILLUMINACLIP:{adapter_file}{illumina_clip_settings} \\
  TRAILING:{trailing_min_quality} \\
  SLIDINGWINDOW:{sliding_window1}:{sliding_window2} \\
  MINLEN:{min_length} \\
  HEADCROP:{headcrop}""".format(
        mode = "PE",
        threads = threads,
        quality_offset = quality_offset,
        input1 = input1,
        input2 = input2,
        paired_output1 = paired_output1,
        paired_output2 = paired_output2,
        unpaired_output1 = unpaired_output1,
        unpaired_output2 = unpaired_output2,
        adapter_file=adapter_file,
        illumina_clip_settings=illumina_clip_settings,
        trailing_min_quality=trailing_min_quality,
        min_length = min_length,
        sliding_window1 = sliding_window1,
        sliding_window2 = sliding_window2,
        headcrop = str(headcrop)
  )    
  # If crop is defined in the .ini file, we add it to the job.command string/attribute
  if isinstance( crop, int ):
    job.command += """ CROP:{crop}""".format(crop = str(crop))
    job.command += " \\\n  2> " + trim_log
    # Compute statistics
    job.command += " && \\\ngrep ^Input " + trim_log + " | perl -pe 's/^Input Read Pairs: (\\d+).*Both Surviving: (\\d+).*Forward Only Surviving: (\\d+).*$/Raw Fragments,\\1\\nFragment Surviving,\\2\\nSingle Surviving,\\3/' > " + trim_sta
 
    # finally return job object.
    return job

The 9 arguments passed to the trimmomatic function are the R1 and R2 fastq.gz files, followed by the 4 output files that trimmomatic will generate. We then specify the quality offset (usually phred 33) one outdir and one filepath that will contain some trimming statistics. These 9 arguments will be used to populate the trimmomatic job arguments. It is critical to specify which are the inputs and outputs of each job as these values are used by the WMS to manage job dependencies when generating jobs for the SLURM or SGE scheduler. In the case of this particular job, once the trimmomatic command sucessfully executes and returns an exit status of zero, 6 files will have been generated : 4 fastq.gz + 1 log + 1 statistic file. Altogeter these last touched files represent the output file defined in the Job object definition and as such is important since it is also used as inputs for the next job bbduk_paired_<readset_name>, so that the dependency network can be properly built.

Summary

To summarize (Figure 2), the main pipeline worflow is defined in shotgunmg.py where each Job object will be instanciated and defined in either bio/shotgun_metagenomics.py or bio/microbial_ecology.py. Customizable parameters defined in the job definition of these two Python libraries are defined in the .ini file (ShotgunMG.ini).

Figure 2. Summary of the interaction between main pipeline workflow, its libraries and the .ini file.

2- Workflow walkthrough

Since a complete run of the pipeline can generate hundreds of jobs, we will focus on the jobs of steps 1 to 3 (trim, remove_contam, assembly) and then step 5 (abundance) and finally step 17 (binning).In the next section, we will go through these selected steps of the ShotgunMG pipeline and describe the results that are generated in each of these steps. For this example, the pipeline was run with paired-end fastqs configuration coming from a bacterial mock community.

Description of results generated by each job

For the sake of simplicity, only the commands themselves will be shown. For more details, the reader is referred to the complete commands.sh file example included in the supplementary material of the ShotgunMG website jtremblay.github.io/shotgunmg.html.

Step 1 - trim

This step implements trimmomatic in the objective of removing sequencing adapters from each read and performing quality control to remove reads of low quality. A particular emphasis should be put in appropriately head and tail cropping the reads. Running FastQC on reads before and after trimming should be done to validate trimming step.

JOB: trimmomatic_<readset_name>

module load nrc/java/jdk1.8.0_144 nrc/trimmomatic/0.39 && \

java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
  -threads 6 \
  -phred33 \
  /raw_reads/Microbial_Mock_Community_Staggered_B_R1.fastq.gz ./raw_reads/Microbial_Mock_Community_Staggered_B_R2
.fastq.gz \
  qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single2.fastq.gz \
  ILLUMINACLIP:$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
  TRAILING:30 \
  SLIDINGWINDOW:4:15 \
  MINLEN:45 \
  HEADCROP:16 CROP:96 \
  2> qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward Only Surviving: (\d+).*$/Raw Fragments,\1\nFragment Surviving,\2\nSingle Surviving,\3/' > qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.stats.csv
NRC_STATE=\$PIPESTATUS

Purpose: To perform quality control to remove reads of low quality and trimming sequencing adapters.

.ini:

[trim]
threads=6
trailing_min_quality=30
average_quality=30
min_length=45
# Have a look at the fastqc profiles to make sure you are cropping enough bases at the beginning of the reads.
headcrop=16
# crop is optional. If for some reason you are still stuck with overrepresented kmers at the end, use crop=<in> to force them off.
crop=96
adapter_fasta=$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa
illumina_clip_settings=:2:10:10
cluster_walltime=-t 6:00:0
cluster_cpu=-N 1 -n 6
cluster_pmem=--mem=32000

Step 2- remove_contam

This step invoke bbduk (bbtools) to remove known sequencing contaminants and PhiX reads from qced reads obtained from the trim step. The remaining reads ./ShotgunMG/duk/ncontam_nphix.fastq will be used for the downstream jobs.

JOB: bbduk_paired_

module load nrc/bbmap/38.11 && \

bbduk.sh \
  in=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz \
  in2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz \
  out=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R1.fastq.gz \
  out2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R2.fastq.gz \
  outm=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R1.fastq \
  outm2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R2.fastq \
  stats=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.duk_contam_interleaved_log.txt \
  k=21 \
  minkmerhits=1 \
  ref=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa \
  overwrite=true \
  threads=1

Purpose:

To scan fastq libraries for known sequencing contaminants and remove them.

.ini:

[DEFAULT]
contaminants=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa

[bbduk]
k=21
c=1
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 1
cluster_pmem=--mem=12000

Step 3- assembly

For the sake of simplicity, the above two steps only showed command examples for one readset. Here for the co-assembly job we show the command with the full readset (6 readsets). Assembly can be done using either spades or megahit. Megahit is much faster than spades and often proved to be the only realist package to use to assemble large sequencing datasets.

JOB: assembly

module load nrc/megahit/1.2.9 && \

rm assembly -rf && \
megahit -t 32 --k-min 31 --k-max 131 --k-step 10 \
  --min-contig-len 1000 \
  -1 qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Even_C/Microbial_Mock_Community_Even_C.ncontam_paired_R1.fastq.gz,qced_reads/Microbia
l_Mock_Community_Staggered_A/Microbial_Mock_Community_Staggered_A.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Staggered_C/Microbial_Mock_Community_Staggered_C.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Even_B/Microbial_Mock_Community_Even_B.ncontam_paired_R1.fastq.gz \
  -2 qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Even_C/Microbial_Mock_Community_Even_C.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Staggered_A/Microbial_Mock_Community_Staggered_A.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Staggered_C/Microbial_Mock_Community_Staggered_C.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Even_B/Microbial_Mock_Community_Even_B.ncontam_paired_R2.fastq.gz \
  --out-dir assembly && \
sed -i 's/^>\\(\\S\\+\\) .*/>\\1/' assembly/final.contigs.fa && \
ln -s -f final.contigs.fa Contigs.fasta && mv Contigs.fasta ./assembly/

Purpose:

To generate assembled contigs from short reads data.

.ini:

[megahit]
# Megahit will use 0.9 pc of the node's memory by default.
min_contig_length=1000
num_threads=32
kmin=31
kmax=131
kstep=10
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 32
cluster_pmem=--mem=514500

Step 4- gene_prediction

Use Prodigal to predict ORFs on contigs

Step 5- abundance

Map qced reads on contigs + generate contigs and gene abundance matrices.

Step 6- exonerate

Split co-assembly and predicted genes fasta files in smaller chunks to perform downstream annotations.

Step 7- kegg_annotation

Attempt to assign a KO to each predicted genes using kofamscan or hmmsearch on the kofam databases. If you have access to the private KEGG GENES databases, diamond blastp can also be used to assign KOs. The default method here is hmmsearch_kofam.

Step 8- rpsblast_cog

RPSBLAST of predicted genes vs COG db.

Step 9- rpsblast_kog

RPSBLAST of predicted genes vs KOG db.

Step 10- hmmscan_pfam

HMMSCAN of predicted genes vs PFAM-A DB.

Step 11- diamond_blastp_nr

DIAMOND BLASTp of predicted genes vs NCBI nr.

Step 12- ncrna

scan for non-coding RNA in contigs: rRNA and tRNA sequence.

Step 13- taxonomic_annotation

Using DIAMOND blastp results generated in the diamond_blastp vs nr step, each contig will be assigned a taxonomic lineage using the CAT package. Taxonomic summaries and figures are then generated.

Step 14- beta_diversity

Compute beta-diversity on contig abundance and gene abundance matrices.

Step 15- alpha_diversity

Compute alpha-diversity (RTK package) on contig abundance and gene abundance matrices.

Step 16- overrepresentation

From the results of KO and COG annotations (step 7 and 8), abundance matrices of all KO and COG will be generated. For each ortholog types, two matrices will be generated, one using normalized (CPM) gene abundance matrix (./gene_abundance/merged_abundance_cpm.tsv) and the other using read counts (i.e. unnormalized) matrix (./gene_abundance/merged_abundance.tsv). Here all genes pointing to the same ortholog are summarized so that the resulting ortholog abundance matrices contain the aggregated values of all gene abundance values assigned to that same ortholog.

Step 17- finalize

Merge all annotations data generated in steps 7-11 and step 13 in a single gff and tsv file.

Step 18- binning

Generates MAGs/bins with the metabat2 and/or maxbin2 software.

3- Smart restart mechanism

One very useful feature of ShotgunMG (and GenPipes-based pipelines) is its built-in smart-restart mechanism. For instance, let’s generate jobs for the first step of the pipeline by running the following command : ShotgunMG.py -c ./ShotgunMG.ini -s 1 -o . -j slurm -r readset.tsv -z ./pipeline.json > ./commands.sh and let’s have look at the job definition:

#-------------------------------------------------------------------------------
# JOB: trim_1_JOB_ID: trimmomatic_Microbial_Mock_Community_Staggered_B
#-------------------------------------------------------------------------------
JOB_NAME=trimmomatic_Microbial_Mock_Community_Staggered_B
JOB_DEPENDENCIES=
JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Staggered_B.f72f4f889cb8f8cf9fcbfac1758b80b5.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
trim_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/java/jdk1.8.0_144 nrc/trimmomatic/0.39 && \

java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
  -threads 6 \
  -phred33 \
  ./raw_reads/Microbial_Mock_Community_Staggered_B_R1.fastq.gz ./raw_reads/Microbial_Mock_Community_Staggered_B_R2.fastq.gz \
  qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single2.fastq.gz \
  ILLUMINACLIP:$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
  TRAILING:30 \
  SLIDINGWINDOW:4:15 \
  MINLEN:45 \
  HEADCROP:16 CROP:96 \
  2> qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward Only Surviving: (\d+).*$/Raw Fragments,\1\nFragment Surviving,\2\nSingle Surviving,\3/' > qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.stats.csv
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
sbatch --account=my-account --export=ALL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 6:00:0  --mem=32000 -N 1 -n 6  | grep -o "[0-9]*")
echo "$trim_1_JOB_ID    $JOB_NAME   $JOB_DEPENDENCIES   $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST

The value stored in the JOB_DONE variable JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Staggered_B.f72f4f889cb8f8cf9fcbfac1758b80b5.nrc.done
is actually an md5 signature of all the files creation date, names and sizes combined with all the job’s parameters used to generate the job. Once properly executed on the compute cluster and an exit_status=0 is returned, the file defined by the JOB_DONE variable will be touched (i.e. created) (if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi).
Once this job has successfully completed, if we decide to run the first step of the pipeline again (ShotgunMG.py -c ./ShotgunMG.ini -s 1 -o . -j slurm -r readset.tsv -z ./pipeline.json > ./commands.sh), we’ll find that no job was written in the standard output commands.sh file. This is because during the job generation process, the pipeline detected that this job was up to date, because:
1) infiles (fastq.gz files) of the job were older than the outfiles (.pair1.fastq.gz, .pair2.fastq.gz, .single1.fastq.gz, single2.fastq.gz) it generated,
2) no changes whatsoever were done on the input files and in consequence,
3) the file size of the input files remained the same.
4) There were no modifications in the parameters in the .ini file under the [trim] section. Overall, the md5 signature is identical between the executed job and the new Job object and in consequence, no new job is created. If we would have modified the header of one of the fastq headers in one of the fastq file and had rerun the pipeline, this job would have been re-generated, because the pipeline would have found differences in the old and new md5 signed .done file. This mechanism is useful if for some reason we realized that one of the parameter given to trimmomatic was not optimal. For instance, if we would increase the crop value from 96 to 100, upon re-execution of shotgunmg.py, all the trimmomatic jobs would be regenerated because the md5 signature of the old .done file would not match the new md5 signature of the new job anymore - in other words, the job would not be up-to-date anymore and would in consequence be re-generated. This mechanism is also useful in the context of random job failure because of a faulty compute node or (as described above) if the pipeline operator is interested in modifying one or many variables of a given step of the pipeline.
Another instance where this mechanism is useful is when a job crashes because we didn’t asked for enough memory or cores for its execution and SLURM detects this and kills off our job because the limits of the resources we asked for were not respected. When that happens, an exit_status other than 0 is returned for that offending job and the rest of the downstream dependent jobs are then killed by SLURM. Instead of manually going through each single job that went through to find which one returned a non-zero exit_status, the pipeline operator can simply rerun ShotgunMG.py with the same parameters and see which jobs are generated.

4- End results

The arguably most important results generated by the pipeline are the following files:

Read counts

Contains read count summary throughout the different steps of the pipeline.

contig_abundance/qc_mappping_stats.tsv

Co-assembly

Co-assembly fasta file:

assembly/Contigs.fasta

Statistics and profiles of co-assembly.

assembly/assembly_stats.txt
assembly/quast/report.html

Gene prediction

Predicted genes in fna, faa and gff formats:

gene_prediction/Contigs_renamed.fna
gene_prediction/Contigs_renamed.faa
gene_prediction/Contigs_renamed.gff

Contigs and genes abundance matrices

Text files in .tsv format. Un normalized (absolute number of qced reads that properly aligned to genes or contigs):

gene_abundance/merged_gene_abundance.tsv
contig_abundance/merged_contig_abundance.tsv

Normalized abundance - using edgeR package with TMM method - gives Count Per Million (CPM):

gene_abundance/merged_gene_abundance_cpm.tsv
contig_abundance/merged_contig_abundance_cpm.tsv

Annotations

All merged annotations (except ncrna) combined in a .tsv file

annotations/annotations.tsv

Alpha diversity

Contains the alpha diversity values. Observed contigs/genes (richness), shannon, etc.

alpha_div/contig_abundance/alpha_richness.tsv
alpha_div/contig_abundance/alpha_shannon.tsv
alpha_div/gene_abundance/alpha_richness.tsv
alpha_div/gene_abundance/alpha_shannon.tsv
alpha_div/rpob/alpha_richness.tsv
alpha_div/rpob/alpha_shannon.tsv

Beta diversity

Contains the beta diversity (Bray-Curtis dissimilarity) distance matrices, PCoAs and interactive figures computed from contig and gene abundance matrices. Interactive 3d figures that can be opened in most internet browsers. Beta diversity metrics are generated using microbiomeutils (https://github.com/jtremblay/microbiomeutils)

betadiv/bray_curtis_contig_bacteriaArchaea/3d_bray_curtis_plot/index.html
betadiv/bray_curtis_contig_abundance/3d_bray_curtis_plot/index.html
betadiv/bray_curtis_gene_abundance/3d_bray_curtis_plot/index.html

Distance matrices.

betadiv/bray_curtis_contig_abundance/bray_curtis_contig_abundance.tsv
betadiv/bray_curtis_contig_bacteriaArchaea/bray_curtis_contig_bacteriaArchaea.tsv
betadiv/bray_curtis_gene_abundance/bray_curtis_gene_abundance.tsv

Principal coordinates analysis tables computed from the distance matrices described above.

betadiv/bray_curtis_contig_abundance/bray_curtis_contig_abundance_coords.tsv
betadiv/bray_curtis_contig_bacteriaArchaea/bray_curtis_contig_bacteriaArchaea_coords.tsv
betadiv/bray_curtis_gene_abundance/bray_curtis_gene_abundance_coords.tsv

DIAMOND BLASTp output of genes vs KEGG genes DB or kofamscan results

# Diamond blastp vs KEGG GENES db (if access to KEGG license)
annotations/blastp_kegg.tsv
annotations/KOs_parsed.tsv

# kofamscan (free - no license required)
annotations/kofamscan.tsv
annotations/KOs_parsed.tsv

# hmmsearch kofam (free - no license required)
annotations/hmmsearch_kofam.tsv
annotations/KOs_parsed.tsv

DIAMOND BLASTp output of genes vs NCBI nr DB

annotations/blastp_nr.tsv

hmmscan output of genes vs PFAM-A DB.

Three different types of output are provided. Please consult HMMer 3 documentation for more details.

annotations/hmmscan_pfam_domtblout.tsv
annotations/hmmscan_pfam_pfamtblout.tsv
annotations/hmmscan_pfam_tblout.tsv

tRNA (tRNAscan) and rRNA (barrnap) abundance/annotations

# tRNA
annotations/trna/trna*tsv

# rRNA
annotations/rrna/rdp/*
annotations/rrna/barrnap/*
annotations/rrna/abundance/*
annotations/rrna/abundance/arc_SSU_merged_abundance.tsv
annotations/rrna/abundance/bac_LSU_merged_abundance.tsv
annotations/rrna/abundance/bac_SSU_merged_abundance.tsv
annotations/rrna/abundance/euk_LSU_merged_abundance.tsv
annotations/rrna/abundance/euk_SSU_merged_abundance.tsv

Taxonomic profiles

Taxonomic profile tables at L1=kindgom, L2=phylum, L3=class, L4=order, L5=family and L6=genus. Both in absolute (consensus rarefaction) or relative. abundance.

# All organisms
annotations/taxonomy_consensus/all/absolute/
annotations/taxonomy_consensus/all/relative/
# Bacteria/Archaea only
annotations/taxonomy_consensus/bacteriaArchaea/absolute/
annotations/taxonomy_consensus/bacteriaArchaea/relative/
# Non-bacterial/archaeal:
annotations/taxonomy_consensus/others/absolute/
annotations/taxonomy_consensus/others/relative/

5- Job output folder

When a job is launched/submitted whether on the job scheduler or directly on the terminal with the batch option, the standard output and error are redirected and written in the ./job_output/ folder in the step subfolder that this job belongs to. For instance, consider the following job, where we have a JOB_OUTPUT_DIR set to ./job_output. The trim step (step 1) will generate a subdirectory :

STEP=trim
mkdir -p $JOB_OUTPUT_DIR/$STEP

For instance, if we look at the trimmomatic_Microbial_Mock_Community_Even_A job, we have the following instructions which will define the file that will hold the standard output and error:

JOB_NAME=trimmomatic_Microbial_Mock_Community_Even_A
JOB_DEPENDENCIES=
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o

Finally, when the job is submitted to the job scheduler with the following command: sbatch --account=my_account_xyz --export=ALL --mail-type=END --mail-user=$JOB_MAIL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 6:00:0 --mem=32000 -N 1 -n 6, the -o $JOB_OUTPUT instructs the job scheduler were to write the standard output and error.

Here is the complete chunk of instructions:

OUTPUT_DIR=/fs/hnas1-evs1/Dgrdi/grdi_eco/training/shotgun_metagenomics_20201125/mock_community_shotgunMG/megahit_bwa
JOB_OUTPUT_DIR=$OUTPUT_DIR/job_output
TIMESTAMP=`date +%FT%H.%M.%S`
JOB_LIST=$JOB_OUTPUT_DIR/Metagenomics_job_list_$TIMESTAMP
mkdir -p $OUTPUT_DIR
cd $OUTPUT_DIR

#-------------------------------------------------------------------------------
# STEP: trim
#-------------------------------------------------------------------------------
STEP=trim
mkdir -p $JOB_OUTPUT_DIR/$STEP

#-------------------------------------------------------------------------------
# JOB: trim_1_JOB_ID: trimmomatic_Microbial_Mock_Community_Even_A
#-------------------------------------------------------------------------------
JOB_NAME=trimmomatic_Microbial_Mock_Community_Even_A
JOB_DEPENDENCIES=
JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Even_A.e8414ede7cfd9241c50df6291e44ab89.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
trim_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/java/jdk1.8.0_171 nrc/trimmomatic/0.39 && \

java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
  -threads 6 \
  -phred33 \
  /fs/hnas1-evs1/Dgrdi/grdi_eco/training/shotgun_metagenomics_20201125/mock_community_shotgunMG/megahit_bwa/./raw_reads//Microbial_Mock_Community_Even_A_R1.fastq.gz /fs/
  qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even
  ILLUMINACLIP:/space/project/grdi/eco/bioinfo-tools/nrc/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
  TRAILING:30 \
  SLIDINGWINDOW:4:15 \
  MINLEN:45 \
  HEADCROP:16 CROP:96 \
  2> qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward On
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
jobsub -P nrc_eme -V -j y -l res_image=nrc/nrc_all_default_ubuntu-18.04-amd64_latest -wd $OUTPUT_DIR -o $JOB_OUTPUT -N $JOB_NAME -l h_rt=6:00:0  -l res_mem=32000 -pe dev
echo "$trim_1_JOB_ID    $JOB_NAME   $JOB_DEPENDENCIES   $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST

6- Integrating more options in a given bioinformatics package

One useful feature of ShotgunMG (and the GenPipes WMS) is the possibility to customize each step of a workflow. For instance, if we would want to add an option for Megahit, we would need to add a few lines of code to the megahit function found in the bio/shotgun_metagenomics.py file. As a reminder, the megahit_R1_R2 function is called from the main pipeline script/wrapper shotgunmg.py:

job = shotgun_metagenomics.megahit_R1_R2(
  fastq_gz_list_R1,
  fastq_gz_list_R2,
  os.path.join("assembly"),
  reads_type
)
job.name = "megahit"
job.subname = "megahit"
jobs.append(job)

where a job object will be instantiated in the megahit_R1_R2 function of the bio/shotgun_metagenomics.py library:

def megahit_R1_R2(infiles_R1, infiles_R2, outdir, type):
     
  if(type == "pe"):
    infiles_string_R1 = "-1 " + ",".join(infiles_R1);
    infiles_string_R2 = "-2 " + ",".join(infiles_R2);

  job = Job( 
    infiles_R1 + infiles_R2,
    [os.path.join(outdir, "Contigs.fasta"), os.path.join(outdir, "final.contigs.fa")],
    [
      ['megahit', 'module_megahit']
    ]
  )    
  job.command="""
rm {outdir} -rf && \\
megahit -t {num_threads} --k-min {kmin} --k-max {kmax} --k-step {kstep} \\
  --min-contig-len {min_contig_length} \\
  {infiles_R1} \\
  {infiles_R2} \\
  --out-dir {outdir} && \\
sed -i 's/^>\\\(\\\S\\\+\\\) .*/>\\\\1/' final.contigs.fa && \\
ln -s -f final.contigs.fa Contigs.fasta && mv Contigs.fasta ./assembly/""".format(
    min_contig_length = config.param("megahit", "min_contig_length", 1, "posint"),
    num_threads = config.param("megahit", "num_threads", 1, "posint"),
    kmin = config.param("megahit", "kmin", 1, "posint"),
    kmax = config.param("megahit", "kmax", 1, "posint"),
    kstep = config.param("megahit", "kstep", 1, "int"),
    infiles_R1 = infiles_string_R1,
    infiles_R2 = infiles_string_R2,
    outdir = outdir
  )

  return job

In the current implementation, we only included 5 parameters specific to Megahit (min_contig_length, num_threads, kmin, kmax and kstep. If for instance at some point we realize that it would be useful to implement support for Megahit’s --prune-depth <int> option, we would need to add a line specifying this option in the megahit_R1_R2 function:

def megahit_R1_R2(infiles_R1, infiles_R2, outdir, type):
     
  if(type == "pe"):
    infiles_string_R1 = "-1 " + ",".join(infiles_R1);
    infiles_string_R2 = "-2 " + ",".join(infiles_R2);

  job = Job( 
    infiles_R1 + infiles_R2,
    [os.path.join(outdir, "Contigs.fasta"), os.path.join(outdir, "final.contigs.fa")],
    [
      ['megahit', 'module_megahit']
    ]
  )    
  job.command="""
rm {outdir} -rf && \\
megahit -t {num_threads} --k-min {kmin} --k-max {kmax} --k-step {kstep} \\
  --min-contig-len {min_contig_length} \\
  {infiles_R1} \\
  {infiles_R2} \\
  --out-dir {outdir} && \\
sed -i 's/^>\\\(\\\S\\\+\\\) .*/>\\\\1/' assembly/final.contigs.fa && \\
ln -s -f final.contigs.fa Contigs.fasta && mv Contigs.fasta ./assembly/""".format(
    min_contig_length = config.param("megahit", "min_contig_length", 1, "posint"),
    num_threads = config.param("megahit", "num_threads", 1, "posint"),
    kmin = config.param("megahit", "kmin", 1, "posint"),
    kmax = config.param("megahit", "kmax", 1, "posint"),
    kstep = config.param("megahit", "kstep", 1, "int"),
    prune_level = config.param("megahit", "prune_level", 1, "int"), #!! <--- Here we added a new option
    infiles_R1 = infiles_string_R1,
    infiles_R2 = infiles_string_R2,
    outdir = outdir
  )

  return job

We would also need to add the option prune_level in the .ini file under the [megahit] category.

[megahit]
# Megahit will use 0.9 pc of the node's memory by default.
min_contig_length=1000
num_threads=32
kmin=31
kmax=131
kstep=10
prune_level=3
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 32
cluster_pmem=--mem=514500

We should be ready to go with this new option implementation. Adding an option for Megahit is a simple case scenario as it consists in the simple implementation of an additional argument to the megahit command.