ShotgunMG v1.3.2 user guide
Content
This document is divided into the following sections:
1) Global overview
2) Workflow walkthrough
3) Smart restart mechanism
4) End results
5) Information regarding the job ou:wtput folder
where standard output and standard error of each job are
written
6) Integrating more options in a given
bioinformatics package
All files being referred to in this document can be found in the companion docker image: https://cloud.docker.com/u/julio514/repository/docker/julio514/centos The first step should be to install this image on your computer.
1- Getting started
Global overview
ShotgunMG is a bioinformatics pipeline geared to process short reads metagenomic sequencing data on High Performance Computing (HPC) environments. It is intended for experienced bioinformaticians with experience on a Linux-based OS (i.e. command line) and a HPC environment, including interacting with a job scheduler, in our case, SLURM or SGE. The pipeline relies on the GenPipes Workflow Management System (WMS) (https://doi.org/10.1093/gigascience/giz037). The pipeline’s methodology is described in details (among others) in the following preprint (https://www.biorxiv.org/content/10.1101/2022.04.19.488797v3)
The pipeline requires the 3 following elements in order to be
functional:
1) The pipeline itself (nrc_pipeline_public): https://bitbucket.org/jtremblay514/nrc_pipeline_public/src/master/
(or v1.3.2)
2) A set of heteregenous Perl and R scripts (nrc_tools_public): https://bitbucket.org/jtremblay514/nrc_tools_public/src/master/
(or v1.3.2)
3) An extensive set of third-party software listed below and the
versions used and tested with the pipeline. Installation details of each
of these packages is also available (nrc_resources_public): https://bitbucket.org/jtremblay514/nrc_resources_public/src/master/
(or v1.3.2)
List of third party software:
trimmomatic/0.39; megahit/1.2.9; metabat/2.12.1;
perl/5.26.0; bbmap/38.11; python/3.9.0;
R/4.2.1; java/jdk1.8.0_144; blast/2.13.0+;
bedtools/2.23 and 2.29; barrnap/0.9; hmmer/3.3.2;
metabat/2.12.1; maxbin/2.7; prodigal/2.6.3;
CAT/5.2.3; RTK/0.93.2; trnascan-se/2.0.7;
exonerate/2.2.0; checkm/1.1.3; pplacer/1.1;
quast/5.0.2; fraggenescan/1.31; spades/3.15.0;
trnascan-se/2.0.7; infernal/1.1.3; kofamscan/1.3.0;
parallel/20220322; pigz/2.3.4; diamond/2.0.15
Installation of all the required libraries and external packages is a time consuming task and for the purpose of this guide, we’ll refer the reader to the CentOS 7 image that we provide, which includes all the necessary packages needed to properly execute ShotgunMG.
The pipeline has been written in such a way that the user can use various software packages in the workflow. The pipeline supports two shotgun metagenome assemblers (MEGAHIT and meta-SPAdes), two reads mapper (bwa-mem and bbmap) and two metagenome binning methods (MetaBAT2 and MaxBin2).
Setting up the databases
The databases needed by ShotgunMG were not included in the Docker image, because of size constraints. The following databases should be manually installed:
CAT
Go here - https://github.com/dutilh/CAT - and follow the instructions under the preconstructed databases section,
CheckM
Follow the instructions : https://github.com/Ecogenomics/CheckM
NCBI nr
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz (Do not
forget to run diamond makedb --in nr -p 4 --db nr.dmnd
once
downloaded).
PFAM-A
Available here: http://ftp.ebi.ac.uk/pub/databases/Pfam. Use the latest
version. May have to run hmmpress
once downloaded.
Contaminants
Available here: http://jtremblay.github.io/files/contaminants.tar.gz. Contains known Illumina contaminants, other sequencing artefacts and adapter sequences from various kits (NexteraXT, TruSeq, etc.). In fasta format.
KEGG
KEGG orthologs (KO) assignment is done using kofamscan available
here: https://www.genome.jp/ftp/tools/kofam_scan/kofam_scan-1.3.0.tar.gz.
In order to run kofamscan, you will need to have the HMM profiles of
each KO - available here: https://www.genome.jp/ftp/db/kofam/profiles.tar.gz - and
also the KO link file - available here: https://www.genome.jp/ftp/db/kofam/ko_list.gz. Another
file needed to link each KO to their associated pathway and/or module is
available here: https://https://jtremblay.github.io/files/kegg_ref_pathways_modules_combined.tsv.gz.
Once downloaded, uncompress these files and move them into their
location which should be $INSTALL_HOME/databases/kofam/cat ./profiles/K* > kofam.hmm
and run hmmpress
hmmpress kofam.hmm
). Once done this kofam.hmm file can be
used with the hmmsearch
software - see KEGG section
downstream. Our internal benchmarks showed that hmmsearch and kofamscan
gave identical results (i.e. for both methods, each gene pointed to the
same KO).
Setting up files needed by the pipeline
The first step in running ShotgunMG is to setup the files required by
the pipeline to run. Fastq libraries usually come in the form of
demultiplexed paired end sequencing libraries - one library per sample.
These .fastq.gz
files should be stored in a directory
labeled raw_reads/
. The ShotgunMG.ini
and
readset.tsv
files are needed as well (more on this
below).
Once the docker image is properly installed and loaded, we can go to the following directory
cd $INSTALL_HOME
root@centos7image:/project/microbiome_genomics$ pwd
/project/microbiome_genomics
Next we navigate to the projects directory where we have two directories:
>cd projects
>ls -l
mock_community_shotgunMG
The folder mock_community_shotgunMG
contains all the
files necessary for the execution of pipeline. The raw data is from mock
community samples (HM-782D and HM-783D, BEI resources, Manassas, VA,
USA) prepared for shotgun metagenomics libraries (NexteraXT) and
sequenced on a HiSeq2500 system on a 2x100 configuration. The
ShotgunMG.ini
file is configured to run a
trimmomatic/megahit/metabat2 workflow.
root@centos7image:/project/microbiome_genomics/projects$ cd mock_community_shotgunMG/
root@centos7image:/project/microbiome_genomics/projects/mock_community_shotgunMG$ ls -l
total 52
-rwxrwxr-x 1 root root 12175 Feb 6 20:17 ShotgunMG.ini
drwxrwxr-x 1 root root 4096 Feb 7 20:06 job_output
-rw-r--r-- 1 root root 3632 Feb 4 20:08 mapping_file.tsv
drwxrwsr-x 1 root root 4096 Feb 1 15:45 raw_reads
-rwxr-x--- 1 root root 1261 Feb 6 20:41 workflow.sh
The two inputs we need in order to launch the pipeline are :
1) the raw fastq.gz reads, which should all be stored in the
raw_reads/
directory. For paired_end libraries, all files
should be labelled *_R1.fastq.gz
and
*_R2.fastq.gz
. For single end files, each library should
end with _R1.fastq.gz
. No specific nomenclature is expected
as long as the file names follow these minimal specifications.
>ls -l raw_reads/
-rw-r--r-- 1 root root 9211851 May 27 20:09 Microbial_Mock_Community_Even_A_R1.fastq.gz
-rw-r--r-- 1 root root 9674881 May 27 20:09 Microbial_Mock_Community_Even_A_R2.fastq.gz
-rw-r--r-- 1 root root 7828069 May 27 20:08 Microbial_Mock_Community_Even_B_R1.fastq.gz
-rw-r--r-- 1 root root 8072334 May 27 20:08 Microbial_Mock_Community_Even_B_R2.fastq.gz
-rw-r--r-- 1 root root 9416489 May 27 20:09 Microbial_Mock_Community_Even_C_R1.fastq.gz
-rw-r--r-- 1 root root 9915486 May 27 20:09 Microbial_Mock_Community_Even_C_R1.fastq.gz
-rw-r--r-- 1 root root 5080506 May 27 20:08 Microbial_Mock_Community_Staggered_A_R1.fastq.gz
-rw-r--r-- 1 root root 5586795 May 27 20:08 Microbial_Mock_Community_Staggered_A_R2.fastq.gz
-rw-r--r-- 1 root root 5080506 May 27 20:08 Microbial_Mock_Community_Staggered_B_R1.fastq.gz
-rw-r--r-- 1 root root 5586795 May 27 20:08 Microbial_Mock_Community_Staggered_B_R2.fastq.gz
-rw-r--r-- 1 root root 5080506 May 27 20:08 Microbial_Mock_Community_Staggered_C_R1.fastq.gz
-rw-r--r-- 1 root root 5586795 May 27 20:08 Microbial_Mock_Community_Staggered_C_R2.fastq.gz
We left traces of the relevant commands in the workflow.sh file which contains the two following commands:
>cat workflow.sh
module load nrc/nrc_tools/1.3
# Preprocessing raw reads
generateReadsetSheet.pl --indir ./raw_reads > readset.tsv
# Run the pipeline. The pipeline should be run in 4 parts: 1) steps 1 to 5. 2) step 6 and 3) steps 7-15. 4) MAGs can then be generated with step 16.
# Steps 1-5
shotgunmg.py -c ./ShotgunMG.ini -s 1-5 -o . -j slurm -b barcodes.fasta -z ./pipeline.json > ./commands.sh
The first command calls the generateReadsetSheet.pl
script and will take all .fastq.gz
files in the
raw_reads/
directory and create a
./readset.tsv
file that holds required info to launch the
pipeline.
## Running the pipeline
We can run the pipeline wrapper:
>shotgunmg.py -h
Steps:
1- trim
2- remove_contam
3- assembly
4- gene_prediction
5- abundance
6- exonerate
7- kegg_annotation
8- rpsblast_cog
9- rpsblast_kog
10- hmmscan_pfam
11- diamond_blastp_nr
12- ncrna
13- taxonomic_annotation
14- beta_diversity
15- alpha_diversity
16- overrepresentation
17- finalize
18- binning
Steps 19 and onward are not officially supported.
shotgunmg.py -c ./ShotgunMG.ini -s 1-6 -o . -j slurm -z ./pipeline.json > ./commands.sh
Here -s
argument specifies which steps to run.
Specifying -s 1-6
will run shotgunMG’s 1 to 6 steps:
1- trim
2- remove_contam
3- assembly
4- gene_prediction
5- abundance
6- exonerate
A selection of certain steps only can also be run:
For instance, -s 1-6
will instruct the pipeline to generate
jobs for steps 1 to 6. Once ShotgunMG.py
has been executed,
all the jobs generated by the pipeline will be written in standard
output or in the commands.sh file specified above. If the argument
-j slurm
is specified, all jobs will be written in such a
way that they can be submitted to the SLURM job scheduler along with
their job dependence requirement(s) and syntax, if any. If the
-j batch
argument is specified, the jobs will be written
without the SLURM syntax so they can be run interactively on the
terminal.
The -z ./pipeline.json
argument specifies the json output
file in which to store the pipeline structure, it can then be used for
visualization purposes as shown here - http://jtremblay.github.io/PipelineViewer/ShotgunMG.html
The -c ./ShotgunMG.ini
argument refers to the file that
contains all the various jobs’ parameters for the pipeline and is
described in details below.
shotgunmg.py
and its corresponding libraries :
bio/shotgun_metagenomics.py
and
bio/microbial_ecology.py
The nrc_pipeline_public repository contains the following structure:
├─ bio
│ ├── [156K] shotgun_metagenomics.py
│ ├── [122K] microbial_ecology.py
│ ├── [ 96K] rrna_amplicons.py # For AmpliconTagger pipeline, not actually used in shotgunmg.py
│ ├── [1.0K] sample.py
│ └── [ 671] readset.py
├── core
│ ├── [5.2K] config.py
│ ├── [7.6K] job.py
│ ├── [ 14K] pipeline.py
│ ├── [ 13K] scheduler.py
│ └── [1.3K] step.py
└── pipelines
├── metagenomics
│ ├── [ 68K] shotgunmg.py
├── [ 16K] common.py
└──llumina
└── [4.4K] illumina.py
It holds the Python libraries on which ShotgunMG depends, but the most relevant Python files to discuss here in the context of this guide are the following:
pipelines/metagenomics/shotgunmg.py
bio/shotgun_metagenomics.py
bio/microbial_ecology.py
shotgunmg.py
contains the main pipeline workflow
definition and will make calls to other Python libraries of the
repository to create Job objects and manage the dependencies of each
job, but the bulk of each job invoked by shotgunmg.py
is
defined in shotgunmg.py
and
microbial_ecology.py
and each job defined in these two
libraries will fetch their parameters from the
ShotgunMG.ini
file.
The .ini file.
The .ini file, in our case, ShotgunMG.ini
contains job
scheduler, software, database paths and third party packages specific
parameters that will be used at each step of the pipeline. The first few
lines of the .ini file contain general definitions mainly related to the
job scheduler:
[DEFAULT]
cluster_submit_cmd=sbatch
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 1
cluster_pmem=--mem=12000
# change my_account_name with real account name if running on SLURM.
cluster_other_arg=--account=my_account_name --export=ALL --mail-type=END --mail-user=$JOB_MAIL
cluster_queue=
cluster_work_dir_arg=-D
cluster_output_dir_arg=-o
cluster_job_name_arg=-J
cluster_cmd_produces_job_id=true
# the clusterSubmitCmdSuffix must be set to | grep \"[0-9]\" if clusterCmdProducesJobId=true
cluster_submit_cmd_suffix= | grep -o "[0-9]*"
cluster_dependency_arg=-d afterok:
cluster_dependency_sep=:
raw_read_dir=./raw_reads
current_dir=./
job_output_dir=./jobs_output
For the SGE job scheduler, the .ini file would look something like this:
cluster_submit_cmd=jobsub
cluster_walltime=-l h_rt=12:00:0
cluster_cpu=-pe dev 1 -l res_cpus=1
cluster_pmem=-l res_mem=12000
cluster_other_arg=-P nrc_eme -V -j y -l res_image=nrc/nrc_all_default_ubuntu-18.04-amd64_latest
cluster_queue=
cluster_work_dir_arg=-wd
cluster_output_dir_arg=-o
cluster_job_name_arg=-N
cluster_cmd_produces_job_id=true
# the clusterSubmitCmdSuffix must be set to | grep \"[0-9]\" if clusterCmdProducesJobId=true
cluster_submit_cmd_suffix= | grep -o "[0-9]*"
#cluster_dependency_arg=-W depend=afterok:
cluster_dependency_arg=-hold_jid
cluster_dependency_sep=,
extra_java_flags=-XX:ParallelGCThreads=4
raw_read_dir=./raw_reads
current_dir=./
job_output_dir=./jobs_output
# MANDATORY PARAMETERS FOR GPSC/GRIDENGINE
job_scripts_dir=./job_scripts
install_home=/space/project/grdi/eco/bioinfo-tools/nrc
env_modules=/etc/profile.d/modules.sh
tmpdir=/tmp # Not much space in /tmp, consider changing to a path were enough storage is available.
Here the parameters starting with cluster_
defines
default values and other values required by SLURM. The
raw_reads_dir
specifies where are the raw fastq files
located. The job_output_dir
points to the location of all
stderr output of each job.
The next chunk of parameters defines which modules will be invoked by the pipeline:
module_perl=nrc/perl/5.26.0
module_python2=nrc/python/2.7.18 # deprecated
module_python3=nrc/python/3.9.0
module_R=nrc/R/4.2.1
module_tools=nrc/nrc_tools/1.3.2
module_rdp_classifier=nrc/rdp_classifier/2.5
module_java=nrc/java/jdk1.8.0_171
module_blast=nrc/blast/2.13.1+
module_trimmomatic=nrc/trimmomatic/0.39
module_bwa=nrc/bwa/0.7.17
module_samtools=nrc/samtools/1.9
module_bedtools=nrc/bedtools/2.23.0
module_exonerate=nrc/exonerate/2.2.0
module_hmmer=nrc/hmmer/3.3.2
module_hdf5=nrc/hdf5/1.8.13
module_pigz=nrc/pigz/2.3.4
module_metabat2=nrc/metabat/2.12.1
module_prodigal=nrc/prodigal/2.6.3
module_checkm=nrc/checkm/1.1.3
module_bowtie2=nrc/bowtie2/2.3.2
module_megahit=nrc/megahit/1.2.9
module_pplacer=nrc/pplacer/1.1
module_diamond=nrc/diamond/2.0.15
module_bbmap=nrc/bbmap/38.11
module_CAT=nrc/CAT/5.2.3
module_maxbin2=nrc/maxbin/2.2.7
module_spades=nrc/SPAdes/3.15.0
module_quast=nrc/quast/5.2.0
module_trnascanse=nrc/trnascan-se/2.0.7
module_barrnap=nrc/barrnap/0.9
module_bedtoolsforbarrnap=nrc/bedtools/2.29.2
module_infernal=nrc/infernal/1.1.3
module_rtk=nrc/rtk/0.93.2
module_fraggenescan=nrc/fraggenescan/1.31
For more information on modules, the reader is invited to consult references on Environment Modules (http://modules.sourceforge.net/).
Next are general (but critical) parameters like sequencing reads
configuration, file paths of sequencing adapters, databases, etc. The
most consequential parameters are the following: assembler
,
mapper
, binner
and KO_method
.
ShotgunMG supports the Megahit and Spades assemblers, the bwa-mem and
bbmap reads mapper and metabat2 and/or maxbin2 as MAGs/bins
generators.
# qc_methods can only be: trimmomatic
qc_methods=trimmomatic
# mapper can be one of these: bwa(i.e. bwa-mem) or bbmap
mapper=bbmap
# assembler can be one of these: megahit or spades.
assembler=megahit
# MAGs generation. Can be one of these: metabat2 or maxbin2 or both with a ":" delimiter.
binner=metabat2
# Annotations. kog is more focused on eukaryotes, so it can be skipped if eukaryotes are not the focus
skip_cog=no
skip_kog=yes
skip_blastp_nr=no
# can be either <hmmsearch_kofam> or <kofamscan>. <diamond_blastp> can also be used if you have access to the private KEGG GENES database.
KO_method=hmmsearch_kofam
Next are database paths:
[DB]
#if METAGENOME
contaminants=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa
#if METATRANSCRIPTOME
#contaminants=$INSTALL_HOME/databases/RDP_training_sets/greengenes/2014-08-11/gg_13_5_withTaxonomy_corrected_silva_r117_euk_mito_chloro_sortedlength_99pcid_Illumina_contaminants.fasta
And next are all the parameters of each job generated by the
pipeline. For instance we have the following values under the
[trim]
and [bbduk]
sections:
[trim]
threads=6
trailing_min_quality=30
average_quality=30
min_length=45
# Have a look at the fastqc profiles to make sure you are cropping enough bases at the beginning of the reads.
headcrop=16
# crop is optional. If for some reason you are still stuck with overrepresented kmers at the end, use crop=<in> to force them off.
crop=96
adapter_fasta=$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa
illumina_clip_settings=:2:10:10
cluster_walltime=-t 6:00:0
cluster_cpu=-N 1 -n 6
cluster_pmem=--mem=32000
.
.
.
Important parameters
The most important parameters for the pipeline are listed below.
These parameters should always be doubled-checked before running the
pipeline. Key parameters under the [DEFAULT]
and
[DB]
categories are discussed above, but relisted here.
Special attention should be observed for the [trim]
parameters too as accurately qced reads is paramount to achieve a good
assembly and consequently get accurate downstream results. It is
strongly encourage to perform a FastQC analysis for at least one
sequencing library before and after qc to make sure that reads going
into co-assembly are of good quality.
[DB]
#if METAGENOME
contaminants=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa
#if METATRANSCRIPTOME
#contaminants=$INSTALL_HOME/databases/RDP_training_sets/greengenes/2014-08-11/gg_13_5_withTaxonomy_corrected_silva_r117_euk_mito_chloro_sortedlength_99pcid_Illumina_contaminants.fasta
phix=$INSTALL_HOME/databases/contaminants/phix174_NC_001422.1.fasta
## If Bacteria and archea
core=$INSTALL_HOME/databases/fasta/greengenes/core_set_aligned.fasta
rdp_training_set=$INSTALL_HOME/databases/RDP_training_sets/silva/2017-07-12/specie/rRNAClassifier.properties
Jobs and their dependencies
One key feature of ShotgunMG (through the underlying GenPipes engine) is its ability to generate jobs and manage the dependency network of each job. If we look at the first two steps representing three jobs that are being written in the commands.sh file, we see the following instructions:
#-------------------------------------------------------------------------------
# STEP: trim
#-------------------------------------------------------------------------------
STEP=trim
mkdir -p $JOB_OUTPUT_DIR/$STEP
#-------------------------------------------------------------------------------
# JOB: trim_1_JOB_ID: trimmomatic_Microbial_Mock_Community_Staggered_B
#-------------------------------------------------------------------------------
JOB_NAME=trimmomatic_Microbial_Mock_Community_Staggered_B
JOB_DEPENDENCIES=
JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Staggered_B.f72f4f889cb8f8cf9fcbfac1758b80b5.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
trim_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/java/jdk1.8.0_144 nrc/trimmomatic/0.39 && \
java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
-threads 6 \
-phred33 \
/project/6008026/projects/misc/shotgunmg_paper/mock_community/./raw_reads/Microbial_Mock_Community_Staggered_B_R1.fastq.gz /project/6008026/projects/misc/shotgunmg_paper/mock_community/./raw_reads/Microbial_Mock_Community_Staggered_B_R2
qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single1.fastq.gz qced_reads/Microbial_Moc
ILLUMINACLIP:/project/6008026/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
TRAILING:30 \
SLIDINGWINDOW:4:15 \
MINLEN:45 \
HEADCROP:16 CROP:96 \
2> qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward Only Surviving: (\d+).*$/Raw Fragments,\1\nFragment Surviving
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
sbatch --account=rrg-jtrembla --export=ALL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 6:00:0 --mem=32000 -N 1 -n 6 | grep -o "[0-9]*")
echo "$trim_1_JOB_ID $JOB_NAME $JOB_DEPENDENCIES $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST
#-------------------------------------------------------------------------------
# STEP: remove_contam
#-------------------------------------------------------------------------------
STEP=remove_contam
mkdir -p $JOB_OUTPUT_DIR/$STEP
#-------------------------------------------------------------------------------
# JOB: remove_contam_1_JOB_ID: bbduk_paired_Microbial_Mock_Community_Staggered_B
#-------------------------------------------------------------------------------
JOB_NAME=bbduk_paired_Microbial_Mock_Community_Staggered_B
JOB_DEPENDENCIES=$trim_1_JOB_ID
JOB_DONE=job_output/remove_contam/bbduk_paired_Microbial_Mock_Community_Staggered_B.3da67d954e46b705fe615ca301d3b78a.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
remove_contam_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/bbmap/38.11 && \
bbduk.sh \
in=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz \
in2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz \
out=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R1.fastq.gz \
out2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R2.fastq.gz \
outm=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R1.fastq \
outm2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R2.fastq \
stats=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.duk_contam_interleaved_log.txt \
k=21 \
minkmerhits=1 \
ref=/project/6008026/databases/contaminants/Illumina.artifacts.fa \
overwrite=true \
threads=1
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
sbatch --account=rrg-jtrembla --export=ALL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 12:00:0 --mem=12000 -N 1 -n 1 -d afterok:$JOB_DEPENDENCIES | grep -o "[0-9]*")
echo "$remove_contam_1_JOB_ID $JOB_NAME $JOB_DEPENDENCIES $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST
The first step STEP: trim
of the pipeline contains a
single job called
trimmomatic_Microbial_Mock_Community_Staggered_B
and will
run the bbduk.sh
script with proper parameters. This job
contains no dependency: the variable JOB_DEPENDENCIES=
is
not assigned any values and there are no
-d afterok:$JOB_DEPENDENCIES
argument passed on to the
sbatch
command. This means that this job will be executed
as soon as the job scheduler finds appropriate resources to execute it.
In that particular case, we are asking for 12 GB of RAM
--mem=32000
, a walltime of 12 hrs -t 6:00:0
and 1 compute node and 1 core or cpu -N 1 -n 6
. These
values are specified in the ShotgunMG.ini
file under the
[trim]
section.
The second step of the pipeline STEP: remove_contam
contains one job labeled
bbduk_paired_Microbial_Mock_Community_Staggered_B
, which
depends on the sucessful execution of the
trimmomatic_Microbial_Mock_Community_Staggered_B
job. In
this case, the JOB_DEPENDENCY
variable is assigned the
$trim_1_JOB_ID
value
(JOB_DEPENDENCIES=$trim_1_JOB_ID
) and the sbatch command
contains the -d afterok:$JOB_DEPENDENCIES
argument. This
means that SLURM will wait for the $trim_1_JOB_ID
to be
sucessfully completed (i.e. exit_status=0
) before
submitting the
bbduk_paired_Microbial_Mock_Community_Staggered_B
job in
the waiting queue.
How are these jobs generated in the shotgunmg.py
script?
The script pipelines/metagenomics/shotgunmg.py
contains
the narrative of the pipeline and will create job objects relying on the
bio/shotgun_metagenomics.py
and
bio/microbial_ecology.py
Python libraries. For intance the
first trimmomatic_Microbial_Mock_Community_Staggered_B
job
mentionned above is written on lines 77-109 inside the step trim:
```python
def trim(self):
"""
step trim(): Raw fastqs will be trimmed using Trimmomatic. Interleaved fastqs will be generated after trimming.
"""
mkdir_p(os.path.join(self._root_dir, "qced_reads"))
jobs = []
trim_stats = []
for readset in self.readsets: # The readset file was provided in the shotgunmg.py command as an argument where one line = one readset
# The for loop ill iterate over all readset and generate a trim job for each readset.
trim_file_prefix = os.path.join("qced_reads", readset.sample.name, readset.name + ".trim.")
trim_file_prefix_long = os.path.join("qced_reads", readset.sample.name, readset.name + ".trim.long.")
if not os.path.exists(os.path.join("qced_reads", readset.sample.name)):
os.makedirs(os.path.join("qced_reads", readset.sample.name))
if readset.run_type == "PAIRED_END": # If in the readset, the run type is PAIRED_END (majority of time)...
job = shotgun_metagenomics.trimmomatic( # will call the trimmomatic function inside the bio/trimmomatic.py library
readset.fastq1,
readset.fastq2,
trim_file_prefix + "pair1.fastq.gz",
trim_file_prefix + "single1.fastq.gz",
trim_file_prefix + "pair2.fastq.gz",
trim_file_prefix + "single2.fastq.gz",
readset.quality_offset,
trim_file_prefix + "out",
trim_file_prefix + "stats.csv"
)
job.name = "trimmomatic_" + readset.sample.name
job.subname = "trim"
jobs.append(job)
The above excerpt is just an example with one readset, but in
reality, the trim
step will generate one job per readset -
so many more jobs. This job object is instanciated by calling the
trimmomatic function inside the bio/shotgun_metagenomics.py
Python library which is described below:
def trimmomatic(input1, input2, paired_output1, unpaired_output1, paired_output2, unpaired_output2, quality_offset, trim_log, trim_stats):
= Job(
job # Input files - essential to specify for the job dependency network
[input1, input2], #Output files - essential to specify for the job dependency network. This output file will somehow be used as input for the next job definition.
[paired_output1, unpaired_output1, paired_output2, unpaired_output2, trim_log, trim_stats],
['java', 'module_java'],
['trimmomatic', 'module_trimmomatic']
[# modules are defined in the .ini file. In the current case, we ask for the module_tools=nrc/nrc_tools_public/dev (defined in .ini file) to be loaded
]
)
= config.param('trim', 'threads', type='posint') # config.param calls values defined in the .ini fil, here withing [trim] category, threads=6 in the .ini file.
threads = config.param('trim', 'adapter_fasta', type='filepath') # config.param calls values defined in the .ini file. Here withing the [trim] categoryu, adapter_file=$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa (defined in the .ini file)
adapter_file = config.param('trim', 'illumina_clip_settings')
illumina_clip_settings = config.param('trim', 'trailing_min_quality', type='int')
trailing_min_quality = config.param('trim', 'min_length', type='posint')
min_length = config.param('trim', 'headcrop', required=False, type='int')
headcrop = config.param('trim', 'crop', required=False, type='int')
crop = config.param('trim', 'sliding_window1', required=False, type='int')
sliding_window1 = config.param('trim', 'sliding_window2', required=False, type='int')
sliding_window2
if not isinstance( sliding_window1, int ): # Check if sliding_windo1 is defined in the .ini file, if not defined, assigned a default value.
if not(sliding_window1 and sliding_window1.strip()):
= 4
sliding_window1
if not isinstance( sliding_window2, int ):
if not(sliding_window2 and sliding_window2.strip()):
= 15
sliding_window2
# Ppulate the trimmomatic software command itself
= """
job.command java -XX:ParallelGCThreads={threads} -Xmx2G -jar \$TRIMMOMATIC_JAR {mode} \\
-threads {threads} \\
-phred{quality_offset} \\
{input1} {input2} \\
{paired_output1} {unpaired_output1} {paired_output2} {unpaired_output2} \\
ILLUMINACLIP:{adapter_file}{illumina_clip_settings} \\
TRAILING:{trailing_min_quality} \\
SLIDINGWINDOW:{sliding_window1}:{sliding_window2} \\
MINLEN:{min_length} \\
HEADCROP:{headcrop}""".format(
= "PE",
mode = threads,
threads = quality_offset,
quality_offset = input1,
input1 = input2,
input2 = paired_output1,
paired_output1 = paired_output2,
paired_output2 = unpaired_output1,
unpaired_output1 = unpaired_output2,
unpaired_output2 =adapter_file,
adapter_file=illumina_clip_settings,
illumina_clip_settings=trailing_min_quality,
trailing_min_quality= min_length,
min_length = sliding_window1,
sliding_window1 = sliding_window2,
sliding_window2 = str(headcrop)
headcrop
) # If crop is defined in the .ini file, we add it to the job.command string/attribute
if isinstance( crop, int ):
+= """ CROP:{crop}""".format(crop = str(crop))
job.command += " \\\n 2> " + trim_log
job.command # Compute statistics
+= " && \\\ngrep ^Input " + trim_log + " | perl -pe 's/^Input Read Pairs: (\\d+).*Both Surviving: (\\d+).*Forward Only Surviving: (\\d+).*$/Raw Fragments,\\1\\nFragment Surviving,\\2\\nSingle Surviving,\\3/' > " + trim_sta
job.command
# finally return job object.
return job
The 9 arguments passed to the trimmomatic
function are
the R1 and R2 fastq.gz files, followed by the 4 output files that
trimmomatic will generate. We then specify the quality offset (usually
phred 33) one outdir and one filepath that will contain some trimming
statistics. These 9 arguments will be used to populate the trimmomatic
job arguments. It is critical to specify which are the inputs and
outputs of each job as these values are used by the WMS to manage job
dependencies when generating jobs for the SLURM or SGE scheduler. In the
case of this particular job, once the trimmomatic command sucessfully
executes and returns an exit status of zero, 6 files will have been
generated : 4 fastq.gz + 1 log + 1 statistic file. Altogeter these last
touched files represent the output file defined in the Job object
definition and as such is important since it is also used as inputs for
the next job bbduk_paired_<readset_name>
, so that the
dependency network can be properly built.
Summary
To summarize (Figure 2), the main pipeline worflow is defined in
shotgunmg.py
where each Job object will be instanciated and
defined in either bio/shotgun_metagenomics.py
or
bio/microbial_ecology.py
. Customizable parameters defined
in the job definition of these two Python libraries are defined in the
.ini file (ShotgunMG.ini
).
2- Workflow walkthrough
Since a complete run of the pipeline can generate hundreds of jobs, we will focus on the jobs of steps 1 to 3 (trim, remove_contam, assembly) and then step 5 (abundance) and finally step 17 (binning).In the next section, we will go through these selected steps of the ShotgunMG pipeline and describe the results that are generated in each of these steps. For this example, the pipeline was run with paired-end fastqs configuration coming from a bacterial mock community.
Description of results generated by each job
For the sake of simplicity, only the commands themselves will be
shown. For more details, the reader is referred to the complete
commands.sh
file example included in the supplementary
material of the ShotgunMG website
jtremblay.github.io/shotgunmg.html.
Step 1 - trim
This step implements trimmomatic in the objective of removing sequencing adapters from each read and performing quality control to remove reads of low quality. A particular emphasis should be put in appropriately head and tail cropping the reads. Running FastQC on reads before and after trimming should be done to validate trimming step.
JOB: trimmomatic_<readset_name>
module load nrc/java/jdk1.8.0_144 nrc/trimmomatic/0.39 && \
java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
-threads 6 \
-phred33 \
/raw_reads/Microbial_Mock_Community_Staggered_B_R1.fastq.gz ./raw_reads/Microbial_Mock_Community_Staggered_B_R2
.fastq.gz \
qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single2.fastq.gz \
ILLUMINACLIP:$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
TRAILING:30 \
SLIDINGWINDOW:4:15 \
MINLEN:45 \
HEADCROP:16 CROP:96 \
2> qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward Only Surviving: (\d+).*$/Raw Fragments,\1\nFragment Surviving,\2\nSingle Surviving,\3/' > qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.stats.csv
NRC_STATE=\$PIPESTATUS
Purpose: To perform quality control to remove reads of low quality and trimming sequencing adapters.
.ini:
[trim]
threads=6
trailing_min_quality=30
average_quality=30
min_length=45
# Have a look at the fastqc profiles to make sure you are cropping enough bases at the beginning of the reads.
headcrop=16
# crop is optional. If for some reason you are still stuck with overrepresented kmers at the end, use crop=<in> to force them off.
crop=96
adapter_fasta=$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa
illumina_clip_settings=:2:10:10
cluster_walltime=-t 6:00:0
cluster_cpu=-N 1 -n 6
cluster_pmem=--mem=32000
Step 2- remove_contam
This step invoke bbduk (bbtools) to remove known sequencing
contaminants and PhiX reads from qced reads obtained from the trim step.
The remaining reads ./ShotgunMG/duk/ncontam_nphix.fastq
will be used for the downstream jobs.
JOB: bbduk_paired_
module load nrc/bbmap/38.11 && \
bbduk.sh \
in=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz \
in2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz \
out=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R1.fastq.gz \
out2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R2.fastq.gz \
outm=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R1.fastq \
outm2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R2.fastq \
stats=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.duk_contam_interleaved_log.txt \
k=21 \
minkmerhits=1 \
ref=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa \
overwrite=true \
threads=1
Purpose:
To scan fastq libraries for known sequencing contaminants and remove them.
.ini:
[DEFAULT]
contaminants=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa
[bbduk]
k=21
c=1
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 1
cluster_pmem=--mem=12000
Step 3- assembly
For the sake of simplicity, the above two steps only showed command examples for one readset. Here for the co-assembly job we show the command with the full readset (6 readsets). Assembly can be done using either spades or megahit. Megahit is much faster than spades and often proved to be the only realist package to use to assemble large sequencing datasets.
JOB: assembly
module load nrc/megahit/1.2.9 && \
rm assembly -rf && \
megahit -t 32 --k-min 31 --k-max 131 --k-step 10 \
--min-contig-len 1000 \
-1 qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Even_C/Microbial_Mock_Community_Even_C.ncontam_paired_R1.fastq.gz,qced_reads/Microbia
l_Mock_Community_Staggered_A/Microbial_Mock_Community_Staggered_A.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Staggered_C/Microbial_Mock_Community_Staggered_C.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Even_B/Microbial_Mock_Community_Even_B.ncontam_paired_R1.fastq.gz \
-2 qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Even_C/Microbial_Mock_Community_Even_C.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Staggered_A/Microbial_Mock_Community_Staggered_A.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Staggered_C/Microbial_Mock_Community_Staggered_C.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Even_B/Microbial_Mock_Community_Even_B.ncontam_paired_R2.fastq.gz \
--out-dir assembly && \
sed -i 's/^>\\(\\S\\+\\) .*/>\\1/' assembly/final.contigs.fa && \
ln -s -f final.contigs.fa Contigs.fasta && mv Contigs.fasta ./assembly/
Purpose:
To generate assembled contigs from short reads data.
.ini:
[megahit]
# Megahit will use 0.9 pc of the node's memory by default.
min_contig_length=1000
num_threads=32
kmin=31
kmax=131
kstep=10
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 32
cluster_pmem=--mem=514500
Step 4- gene_prediction
Use Prodigal to predict ORFs on contigs
Step 5- abundance
Map qced reads on contigs + generate contigs and gene abundance matrices.
Step 6- exonerate
Split co-assembly and predicted genes fasta files in smaller chunks to perform downstream annotations.
Step 7- kegg_annotation
Attempt to assign a KO to each predicted genes using kofamscan or
hmmsearch on the kofam databases. If you have access to the private KEGG
GENES databases, diamond blastp can also be used to assign KOs. The
default method here is hmmsearch_kofam
.
Step 8- rpsblast_cog
RPSBLAST of predicted genes vs COG db.
Step 9- rpsblast_kog
RPSBLAST of predicted genes vs KOG db.
Step 10- hmmscan_pfam
HMMSCAN of predicted genes vs PFAM-A DB.
Step 11- diamond_blastp_nr
DIAMOND BLASTp of predicted genes vs NCBI nr.
Step 12- ncrna
scan for non-coding RNA in contigs: rRNA and tRNA sequence.
Step 13- taxonomic_annotation
Using DIAMOND blastp results generated in the diamond_blastp vs nr step, each contig will be assigned a taxonomic lineage using the CAT package. Taxonomic summaries and figures are then generated.
Step 14- beta_diversity
Compute beta-diversity on contig abundance and gene abundance matrices.
Step 15- alpha_diversity
Compute alpha-diversity (RTK package) on contig abundance and gene abundance matrices.
Step 16- overrepresentation
From the results of KO and COG annotations (step 7 and 8), abundance
matrices of all KO and COG will be generated. For each ortholog types,
two matrices will be generated, one using normalized (CPM) gene
abundance matrix
(./gene_abundance/merged_abundance_cpm.tsv
) and the other
using read counts (i.e. unnormalized) matrix
(./gene_abundance/merged_abundance.tsv
). Here all genes
pointing to the same ortholog are summarized so that the resulting
ortholog abundance matrices contain the aggregated values of all gene
abundance values assigned to that same ortholog.
Step 17- finalize
Merge all annotations data generated in steps 7-11 and step 13 in a single gff and tsv file.
Step 18- binning
Generates MAGs/bins with the metabat2 and/or maxbin2 software.
3- Smart restart mechanism
One very useful feature of ShotgunMG (and GenPipes-based pipelines)
is its built-in smart-restart mechanism. For instance, let’s generate
jobs for the first step of the pipeline by running the following command
:
ShotgunMG.py -c ./ShotgunMG.ini -s 1 -o . -j slurm -r readset.tsv -z ./pipeline.json > ./commands.sh
and let’s have look at the job definition:
#-------------------------------------------------------------------------------
# JOB: trim_1_JOB_ID: trimmomatic_Microbial_Mock_Community_Staggered_B
#-------------------------------------------------------------------------------
JOB_NAME=trimmomatic_Microbial_Mock_Community_Staggered_B
JOB_DEPENDENCIES=
JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Staggered_B.f72f4f889cb8f8cf9fcbfac1758b80b5.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
trim_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/java/jdk1.8.0_144 nrc/trimmomatic/0.39 && \
java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
-threads 6 \
-phred33 \
./raw_reads/Microbial_Mock_Community_Staggered_B_R1.fastq.gz ./raw_reads/Microbial_Mock_Community_Staggered_B_R2.fastq.gz \
qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single2.fastq.gz \
ILLUMINACLIP:$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
TRAILING:30 \
SLIDINGWINDOW:4:15 \
MINLEN:45 \
HEADCROP:16 CROP:96 \
2> qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward Only Surviving: (\d+).*$/Raw Fragments,\1\nFragment Surviving,\2\nSingle Surviving,\3/' > qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.stats.csv
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
sbatch --account=my-account --export=ALL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 6:00:0 --mem=32000 -N 1 -n 6 | grep -o "[0-9]*")
echo "$trim_1_JOB_ID $JOB_NAME $JOB_DEPENDENCIES $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST
The value stored in the JOB_DONE variable
JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Staggered_B.f72f4f889cb8f8cf9fcbfac1758b80b5.nrc.done
is actually an md5 signature of all the files creation date, names and
sizes combined with all the job’s parameters used to generate the job.
Once properly executed on the compute cluster and an
exit_status=0
is returned, the file defined by the JOB_DONE
variable will be touched (i.e. created)
(if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
).
Once this job has successfully completed, if we decide to run the first
step of the pipeline again
(ShotgunMG.py -c ./ShotgunMG.ini -s 1 -o . -j slurm -r readset.tsv -z ./pipeline.json > ./commands.sh
),
we’ll find that no job was written in the standard output
commands.sh
file. This is because during the job generation
process, the pipeline detected that this job was up to date,
because:
1) infiles (fastq.gz files) of the job were older than the outfiles
(.pair1.fastq.gz, .pair2.fastq.gz, .single1.fastq.gz, single2.fastq.gz
)
it generated,
2) no changes whatsoever were done on the input files and in
consequence,
3) the file size of the input files remained the same.
4) There were no modifications in the parameters in the .ini file under
the [trim] section. Overall, the md5 signature is identical between the
executed job and the new Job object and in consequence, no new job is
created. If we would have modified the header of one of the fastq
headers in one of the fastq file and had rerun the pipeline, this job
would have been re-generated, because the pipeline would have found
differences in the old and new md5 signed .done
file. This
mechanism is useful if for some reason we realized that one of the
parameter given to trimmomatic was not optimal. For instance, if we
would increase the crop value from 96 to 100, upon re-execution of
shotgunmg.py
, all the trimmomatic jobs would be regenerated
because the md5 signature of the old .done
file would not
match the new md5 signature of the new job anymore - in other words, the
job would not be up-to-date anymore and would in consequence be
re-generated. This mechanism is also useful in the context of random job
failure because of a faulty compute node or (as described above) if the
pipeline operator is interested in modifying one or many variables of a
given step of the pipeline.
Another instance where this mechanism is useful is when a job crashes
because we didn’t asked for enough memory or cores for its execution and
SLURM detects this and kills off our job because the limits of the
resources we asked for were not respected. When that happens, an
exit_status other than 0 is returned for that offending job and the rest
of the downstream dependent jobs are then killed by SLURM. Instead of
manually going through each single job that went through to find which
one returned a non-zero exit_status, the pipeline operator can simply
rerun ShotgunMG.py
with the same parameters and see which
jobs are generated.
4- End results
The arguably most important results generated by the pipeline are the following files:
Read counts
Contains read count summary throughout the different steps of the pipeline.
contig_abundance/qc_mappping_stats.tsv
Co-assembly
Co-assembly fasta file:
assembly/Contigs.fasta
Statistics and profiles of co-assembly.
assembly/assembly_stats.txt
assembly/quast/report.html
Gene prediction
Predicted genes in fna, faa and gff formats:
gene_prediction/Contigs_renamed.fna
gene_prediction/Contigs_renamed.faa
gene_prediction/Contigs_renamed.gff
Contigs and genes abundance matrices
Text files in .tsv format. Un normalized (absolute number of qced reads that properly aligned to genes or contigs):
gene_abundance/merged_gene_abundance.tsv
contig_abundance/merged_contig_abundance.tsv
Normalized abundance - using edgeR package with TMM method - gives Count Per Million (CPM):
gene_abundance/merged_gene_abundance_cpm.tsv
contig_abundance/merged_contig_abundance_cpm.tsv
Annotations
All merged annotations (except ncrna) combined in a .tsv file
annotations/annotations.tsv
Alpha diversity
Contains the alpha diversity values. Observed contigs/genes (richness), shannon, etc.
alpha_div/contig_abundance/alpha_richness.tsv
alpha_div/contig_abundance/alpha_shannon.tsv
alpha_div/gene_abundance/alpha_richness.tsv
alpha_div/gene_abundance/alpha_shannon.tsv
alpha_div/rpob/alpha_richness.tsv
alpha_div/rpob/alpha_shannon.tsv
Beta diversity
Contains the beta diversity (Bray-Curtis dissimilarity) distance matrices, PCoAs and interactive figures computed from contig and gene abundance matrices. Interactive 3d figures that can be opened in most internet browsers. Beta diversity metrics are generated using microbiomeutils (https://github.com/jtremblay/microbiomeutils)
betadiv/bray_curtis_contig_bacteriaArchaea/3d_bray_curtis_plot/index.html
betadiv/bray_curtis_contig_abundance/3d_bray_curtis_plot/index.html
betadiv/bray_curtis_gene_abundance/3d_bray_curtis_plot/index.html
Distance matrices.
betadiv/bray_curtis_contig_abundance/bray_curtis_contig_abundance.tsv
betadiv/bray_curtis_contig_bacteriaArchaea/bray_curtis_contig_bacteriaArchaea.tsv
betadiv/bray_curtis_gene_abundance/bray_curtis_gene_abundance.tsv
Principal coordinates analysis tables computed from the distance matrices described above.
betadiv/bray_curtis_contig_abundance/bray_curtis_contig_abundance_coords.tsv
betadiv/bray_curtis_contig_bacteriaArchaea/bray_curtis_contig_bacteriaArchaea_coords.tsv
betadiv/bray_curtis_gene_abundance/bray_curtis_gene_abundance_coords.tsv
DIAMOND BLASTp output of genes vs KEGG genes DB or kofamscan results
# Diamond blastp vs KEGG GENES db (if access to KEGG license)
annotations/blastp_kegg.tsv
annotations/KOs_parsed.tsv
# kofamscan (free - no license required)
annotations/kofamscan.tsv
annotations/KOs_parsed.tsv
# hmmsearch kofam (free - no license required)
annotations/hmmsearch_kofam.tsv
annotations/KOs_parsed.tsv
DIAMOND BLASTp output of genes vs NCBI nr DB
annotations/blastp_nr.tsv
hmmscan output of genes vs PFAM-A DB.
Three different types of output are provided. Please consult HMMer 3 documentation for more details.
annotations/hmmscan_pfam_domtblout.tsv
annotations/hmmscan_pfam_pfamtblout.tsv
annotations/hmmscan_pfam_tblout.tsv
tRNA (tRNAscan) and rRNA (barrnap) abundance/annotations
# tRNA
annotations/trna/trna*tsv
# rRNA
annotations/rrna/rdp/*
annotations/rrna/barrnap/*
annotations/rrna/abundance/*
annotations/rrna/abundance/arc_SSU_merged_abundance.tsv
annotations/rrna/abundance/bac_LSU_merged_abundance.tsv
annotations/rrna/abundance/bac_SSU_merged_abundance.tsv
annotations/rrna/abundance/euk_LSU_merged_abundance.tsv
annotations/rrna/abundance/euk_SSU_merged_abundance.tsv
Taxonomic profiles
Taxonomic profile tables at L1=kindgom, L2=phylum, L3=class, L4=order, L5=family and L6=genus. Both in absolute (consensus rarefaction) or relative. abundance.
# All organisms
annotations/taxonomy_consensus/all/absolute/
annotations/taxonomy_consensus/all/relative/
# Bacteria/Archaea only
annotations/taxonomy_consensus/bacteriaArchaea/absolute/
annotations/taxonomy_consensus/bacteriaArchaea/relative/
# Non-bacterial/archaeal:
annotations/taxonomy_consensus/others/absolute/
annotations/taxonomy_consensus/others/relative/
5- Job output folder
When a job is launched/submitted whether on the job scheduler or
directly on the terminal with the batch
option, the
standard output and error are redirected and written in the
./job_output/
folder in the step subfolder that this job
belongs to. For instance, consider the following job, where we have a
JOB_OUTPUT_DIR
set to ./job_output
. The trim
step (step 1) will generate a subdirectory :
STEP=trim
mkdir -p $JOB_OUTPUT_DIR/$STEP
For instance, if we look at the
trimmomatic_Microbial_Mock_Community_Even_A
job, we have
the following instructions which will define the file that will hold the
standard output and error:
JOB_NAME=trimmomatic_Microbial_Mock_Community_Even_A
JOB_DEPENDENCIES=
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
Finally, when the job is submitted to the job scheduler with the
following command:
sbatch --account=my_account_xyz --export=ALL --mail-type=END --mail-user=$JOB_MAIL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 6:00:0 --mem=32000 -N 1 -n 6
,
the -o $JOB_OUTPUT
instructs the job scheduler were to
write the standard output and error.
Here is the complete chunk of instructions:
OUTPUT_DIR=/fs/hnas1-evs1/Dgrdi/grdi_eco/training/shotgun_metagenomics_20201125/mock_community_shotgunMG/megahit_bwa
JOB_OUTPUT_DIR=$OUTPUT_DIR/job_output
TIMESTAMP=`date +%FT%H.%M.%S`
JOB_LIST=$JOB_OUTPUT_DIR/Metagenomics_job_list_$TIMESTAMP
mkdir -p $OUTPUT_DIR
cd $OUTPUT_DIR
#-------------------------------------------------------------------------------
# STEP: trim
#-------------------------------------------------------------------------------
STEP=trim
mkdir -p $JOB_OUTPUT_DIR/$STEP
#-------------------------------------------------------------------------------
# JOB: trim_1_JOB_ID: trimmomatic_Microbial_Mock_Community_Even_A
#-------------------------------------------------------------------------------
JOB_NAME=trimmomatic_Microbial_Mock_Community_Even_A
JOB_DEPENDENCIES=
JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Even_A.e8414ede7cfd9241c50df6291e44ab89.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
trim_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/java/jdk1.8.0_171 nrc/trimmomatic/0.39 && \
java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
-threads 6 \
-phred33 \
/fs/hnas1-evs1/Dgrdi/grdi_eco/training/shotgun_metagenomics_20201125/mock_community_shotgunMG/megahit_bwa/./raw_reads//Microbial_Mock_Community_Even_A_R1.fastq.gz /fs/
qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even
ILLUMINACLIP:/space/project/grdi/eco/bioinfo-tools/nrc/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
TRAILING:30 \
SLIDINGWINDOW:4:15 \
MINLEN:45 \
HEADCROP:16 CROP:96 \
2> qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward On
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
jobsub -P nrc_eme -V -j y -l res_image=nrc/nrc_all_default_ubuntu-18.04-amd64_latest -wd $OUTPUT_DIR -o $JOB_OUTPUT -N $JOB_NAME -l h_rt=6:00:0 -l res_mem=32000 -pe dev
echo "$trim_1_JOB_ID $JOB_NAME $JOB_DEPENDENCIES $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST
6- Integrating more options in a given bioinformatics package
One useful feature of ShotgunMG (and the GenPipes WMS) is the
possibility to customize each step of a workflow. For instance, if we
would want to add an option for Megahit, we would need to add a few
lines of code to the megahit
function found in the
bio/shotgun_metagenomics.py
file. As a reminder, the
megahit_R1_R2
function is called from the main pipeline
script/wrapper shotgunmg.py
:
= shotgun_metagenomics.megahit_R1_R2(
job
fastq_gz_list_R1,
fastq_gz_list_R2,"assembly"),
os.path.join(
reads_type
)= "megahit"
job.name = "megahit"
job.subname jobs.append(job)
where a job object will be instantiated in the
megahit_R1_R2
function of the
bio/shotgun_metagenomics.py
library:
def megahit_R1_R2(infiles_R1, infiles_R2, outdir, type):
if(type == "pe"):
= "-1 " + ",".join(infiles_R1);
infiles_string_R1 = "-2 " + ",".join(infiles_R2);
infiles_string_R2
= Job(
job + infiles_R2,
infiles_R1 "Contigs.fasta"), os.path.join(outdir, "final.contigs.fa")],
[os.path.join(outdir,
['megahit', 'module_megahit']
[
]
) ="""
job.commandrm {outdir} -rf && \\
megahit -t {num_threads} --k-min {kmin} --k-max {kmax} --k-step {kstep} \\
--min-contig-len {min_contig_length} \\
{infiles_R1} \\
{infiles_R2} \\
--out-dir {outdir} && \\
sed -i 's/^>\\\(\\\S\\\+\\\) .*/>\\\\1/' final.contigs.fa && \\
ln -s -f final.contigs.fa Contigs.fasta && mv Contigs.fasta ./assembly/""".format(
= config.param("megahit", "min_contig_length", 1, "posint"),
min_contig_length = config.param("megahit", "num_threads", 1, "posint"),
num_threads = config.param("megahit", "kmin", 1, "posint"),
kmin = config.param("megahit", "kmax", 1, "posint"),
kmax = config.param("megahit", "kstep", 1, "int"),
kstep = infiles_string_R1,
infiles_R1 = infiles_string_R2,
infiles_R2 = outdir
outdir
)
return job
In the current implementation, we only included 5 parameters specific
to Megahit
(min_contig_length, num_threads, kmin, kmax and kstep
. If
for instance at some point we realize that it would be useful to
implement support for Megahit’s --prune-depth <int>
option, we would need to add a line specifying this option in the
megahit_R1_R2
function:
def megahit_R1_R2(infiles_R1, infiles_R2, outdir, type):
if(type == "pe"):
= "-1 " + ",".join(infiles_R1);
infiles_string_R1 = "-2 " + ",".join(infiles_R2);
infiles_string_R2
= Job(
job + infiles_R2,
infiles_R1 "Contigs.fasta"), os.path.join(outdir, "final.contigs.fa")],
[os.path.join(outdir,
['megahit', 'module_megahit']
[
]
) ="""
job.commandrm {outdir} -rf && \\
megahit -t {num_threads} --k-min {kmin} --k-max {kmax} --k-step {kstep} \\
--min-contig-len {min_contig_length} \\
{infiles_R1} \\
{infiles_R2} \\
--out-dir {outdir} && \\
sed -i 's/^>\\\(\\\S\\\+\\\) .*/>\\\\1/' assembly/final.contigs.fa && \\
ln -s -f final.contigs.fa Contigs.fasta && mv Contigs.fasta ./assembly/""".format(
= config.param("megahit", "min_contig_length", 1, "posint"),
min_contig_length = config.param("megahit", "num_threads", 1, "posint"),
num_threads = config.param("megahit", "kmin", 1, "posint"),
kmin = config.param("megahit", "kmax", 1, "posint"),
kmax = config.param("megahit", "kstep", 1, "int"),
kstep = config.param("megahit", "prune_level", 1, "int"), #!! <--- Here we added a new option
prune_level = infiles_string_R1,
infiles_R1 = infiles_string_R2,
infiles_R2 = outdir
outdir
)
return job
We would also need to add the option prune_level
in the
.ini
file under the [megahit]
category.
[megahit]
# Megahit will use 0.9 pc of the node's memory by default.
min_contig_length=1000
num_threads=32
kmin=31
kmax=131
kstep=10
prune_level=3
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 32
cluster_pmem=--mem=514500
We should be ready to go with this new option implementation. Adding an option for Megahit is a simple case scenario as it consists in the simple implementation of an additional argument to the megahit command.