ShotgunMG v1.3.0 user guide
Content
This document is divided into the following sections:
1) Global overview
2) Workflow walkthrough
3) Smart restart mechanism
4) End results
5) Information regarding the job output folder where standard output and standard error of each job are written
6) Integrating more options in a given bioinformatics package
All files being referred to in this document can be found in the companion docker image: https://cloud.docker.com/u/julio514/repository/docker/julio514/centos (Not yet available, TBD) The first step should be to install this image on your computer.
1- Getting started
Global overview
ShotgunMG is a bioinformatics pipeline geared to process short reads metagenomic sequencing data on High Performance Computing (HPC) environments. It is intended for experienced bioinformaticians with experience on a Linux-based OS (i.e. command line) and a HPC environment, including interacting with a job scheduler, in our case, SLURM or SGE. The pipeline relies on the GenPipes Workflow Management System (WMS) (https://doi.org/10.1093/gigascience/giz037) and requires the 3 following elements in order to be functional:
1) The pipeline itself (nrc_pipeline_public): https://bitbucket.org/jtremblay514/nrc_pipeline_public/src/master/
2) A set of heteregenous Perl and R scripts (nrc_tools_public): https://bitbucket.org/jtremblay514/nrc_tools_public/src/master/
3) An extensive set of third-party software listed below and the versions used and tested with the pipeline. Installation details of each of these packages is also available (nrc_resources_public): https://bitbucket.org/jtremblay514/nrc_resources_public/src/master/
List of third party software:
trimmomatic/0.39,megahit/1.2.9,metabat/2.12.1,perl/5.26.0,bbmap/38.11,python/2.7.5,python/3.6.5,python/3.9.0,R/3.6.0,java/jdk1.8.0_144,blast/2.10.1+,bedtools/2.23,barrnap/0.9,hmmer/3.1b2,metabat/2.12.1,maxbin/2.7,microbiomeutils/0.9
Installation of all the required libraries and external packages is a time consuming task and for the purpose of this guide, we’ll refer the reader to the CentOS 7 image that we provide, which includes all the necessary packages needed to properly execute ShotgunMG.
The pipeline has been written in such a way that the user can use various software packages in the workflow. The pipeline supports two shotgun metagenome assemblers (Megahit and Spades), two reads mapper (bwa-mem and bbmap) and two metagenome binning methods (MetaBAT and MaxBin).
Setting up the databases
The databases needed by ShotgunMG were not included in the Docker image, because of size constraints. The following databases should be manually installed:
CAT
Go here - https://github.com/dutilh/CAT - and follow the instructions under the preconstructed databases section,
CheckM
Follow the instructions : https://github.com/Ecogenomics/CheckM
NCBI nr
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz (Do not forget to run diamond makedb --in nr -p 4 --db nr.dmnd
once downloaded).
PFAM-A
Available here: http://ftp.ebi.ac.uk/pub/databases/Pfam. Use the latest version. May have to run hmmpress
once downloaded.
Contaminants sequences
Available here: http://jtremblay.github.io/files/contaminants.tar.gz. Contains known Illumina contaminants, other sequencing artefacts and adapter sequences from various kits (NexteraXT, TruSeq, etc.). In fasta format.
Setting up files needed by the pipeline
The first step in running ShotgunMG is to setup the files required by the pipeline to run. Fastq libraries usually come in the form of demultiplexed paired end sequencing libraries - one library per sample. These .fastq.gz
files should be stored in a directory labeled raw_reads/
. The ShotgunMG.ini
and readset.tsv
files are needed as well (more on this below).
Once the docker image is properly installed and loaded, we can go to the following directory
cd $INSTALL_HOME
root@centos7image:/project/microbiome_genomics$ pwd
/project/microbiome_genomics
Next we navigate to the projects directory where we have two directories:
>cd projects
>ls -l
mock_community
mock_community_processed
The folder mock_community_processed
contains and example of the complete execution of ShotgunMG on mock cummunity samples (HM-782D and HM-783D, BEI resources, Manassas, VA, USA) prepared for shotgun metagenomics libraries (NexteraXT) and sequenced on a HiSeq2500 system on a 2x100 configuration. Resulting fastqs were processed with ShotgunMG using a trimmomatic/megahit/metabat2 workflow. The mock_community
directory contains the raw_reads with other files needed to run the pipeline, but no data processing. In the mock_community
directory, we have the following files and a raw_reads
directory which holds a subset of samples from the AAD project:
root@centos7image:/project/microbiome_genomics/projects$ cd mock_community/
root@centos7image:/project/microbiome_genomics/projects/mock_community$ ls -l
total 52
-rwxrwxr-x 1 root root 12175 Feb 6 20:17 ShotgunMG.ini
drwxrwxr-x 1 root root 4096 Feb 7 20:06 job_output
-rw-r--r-- 1 root root 3632 Feb 4 20:08 mapping_file.tsv
drwxrwsr-x 1 root root 4096 Feb 1 15:45 raw_reads
-rwxr-x--- 1 root root 1261 Feb 6 20:41 workflow.sh
The two inputs we need in order to launch the pipeline are :
1) the raw fastq.gz reads, which should all be stored in the raw_reads/
directory. For paired_end libraries, all files should be labelled *_R1.fastq.gz
and *_R2.fastq.gz
. For single end files, each library should end with _R1.fastq.gz
. No specific nomenclature is expected as long as the file names follow these minimal specifications.
>ls -l raw_reads/
-rw-r--r-- 1 root root 9211851 May 27 20:09 Microbial_Mock_Community_Even_A_R1.fastq.gz
-rw-r--r-- 1 root root 9674881 May 27 20:09 Microbial_Mock_Community_Even_A_R2.fastq.gz
-rw-r--r-- 1 root root 7828069 May 27 20:08 Microbial_Mock_Community_Even_B_R1.fastq.gz
-rw-r--r-- 1 root root 8072334 May 27 20:08 Microbial_Mock_Community_Even_B_R2.fastq.gz
-rw-r--r-- 1 root root 9416489 May 27 20:09 Microbial_Mock_Community_Even_C_R1.fastq.gz
-rw-r--r-- 1 root root 9915486 May 27 20:09 Microbial_Mock_Community_Even_C_R1.fastq.gz
-rw-r--r-- 1 root root 5080506 May 27 20:08 Microbial_Mock_Community_Staggered_A_R1.fastq.gz
-rw-r--r-- 1 root root 5586795 May 27 20:08 Microbial_Mock_Community_Staggered_A_R2.fastq.gz
-rw-r--r-- 1 root root 5080506 May 27 20:08 Microbial_Mock_Community_Staggered_B_R1.fastq.gz
-rw-r--r-- 1 root root 5586795 May 27 20:08 Microbial_Mock_Community_Staggered_B_R2.fastq.gz
-rw-r--r-- 1 root root 5080506 May 27 20:08 Microbial_Mock_Community_Staggered_C_R1.fastq.gz
-rw-r--r-- 1 root root 5586795 May 27 20:08 Microbial_Mock_Community_Staggered_C_R2.fastq.gz
We left traces of the relevant commands in the workflow.sh file which contains the two following commands:
>cat workflow.sh
module load nrc/nrc_tools/1.3
# Preprocessing raw reads
generateReadsetSheet.pl --indir ./raw_reads > readset.tsv
# Run the pipeline. The pipeline should be run in 4 parts: 1) steps 1 to 5. 2) step 6 and 3) steps 7-15. 4) MAGs can then be generated with step 16.
# Steps 1-5
shotgunmg.py -c ./ShotgunMG.ini -s 1-5 -o . -j slurm -b barcodes.fasta -z ./pipeline.json > ./commands.sh
The first command calls the generateReadsetSheet.pl
script and will take all .fastq.gz
files in the raw_reads/
directory and create a ./readset.tsv
file that holds required info to launch the pipeline.
## Running the pipeline
We can run the pipeline wrapper:
>shotgunMG.py -h
Steps:
1- trim
2- remove_contam
3- assembly
4- gene_prediction
5- abundance
6- exonerate
7- diamond_blastp_kegg
8- rpsblast_cog
9- rpsblast_kog
10- hmmscan_pfam
11- diamond_blastp_nr
12- ncrna
13- taxonomic_annotation
14- beta_diversity
15- alpha_diversity
16- finalize
17- binning
shotgunmg.py -c ./ShotgunMG.ini -s 1-6 -o . -j slurm -z ./pipeline.json > ./commands.sh
Here -s
argument specifies which steps to run. Specifying -s 1-6
will run shotgunMG’s 1 to 6 steps:
1- trim
2- remove_contam
3- assembly
4- gene_prediction
5- abundance
6- exonerate
A selection of certain steps only can also be run:
For instance, -s 1-6
will instruct the pipeline to generate jobs for steps 1 to 6. Once ShotgunMG.py
has been executed, all the jobs generated by the pipeline will be written in standard output or in the commands.sh file specified above. If the argument -j slurm
is specified, all jobs will be written in such a way that they can be submitted to the SLURM job scheduler along with their job dependence requirement(s) and syntax, if any. If the -j batch
argument is specified, the jobs will be written without the SLURM syntax so they can be run interactively on the terminal.
The -z ./pipeline.json
argument specifies the json output file in which to store the pipeline structure, it can then be used for visualization purposes as shown here - http://jtremblay.github.io/PipelineViewer/ShotgunMG.html
The -c ./ShotgunMG.ini
argument refers to the file that contains all the various jobs’ parameters for the pipeline and is described in details below.
shotgunmg.py
and its corresponding libraries : bio/shotgun_metagenomics.py
and bio/microbial_ecology.py
The nrc_pipeline_public repository contains the following structure:
├─ bio
│ ├── [156K] shotgun_metagenomics.py
│ ├── [122K] microbial_ecology.py
│ ├── [ 96K] rrna_amplicons.py # For AmpliconTagger pipeline, not actually used in shotgunmg.py
│ ├── [1.0K] sample.py
│ └── [ 671] readset.py
├── core
│ ├── [5.2K] config.py
│ ├── [7.6K] job.py
│ ├── [ 14K] pipeline.py
│ ├── [ 13K] scheduler.py
│ └── [1.3K] step.py
└── pipelines
├── metagenomics
│ ├── [ 68K] shotgunmg.py
├── [ 16K] common.py
└──llumina
└── [4.4K] illumina.py
It holds the Python libraries on which ShotgunMG depends, but the most relevant Python files to discuss here in the context of this guide are the following:
pipelines/metagenomics/shotgunmg.py
bio/shotgun_metagenomics.py
bio/microbial_ecology.py
shotgunmg.py
contains the main pipeline workflow definition and will make calls to other Python libraries of the repository to create Job objects and manage the dependencies of each job, but the bulk of each job invoked by shotgunmg.py
is defined in shotgunmg.py
and microbial_ecology.py
and each job defined in these two libraries will fetch their parameters from the ShotgunMG.ini
file.
The .ini file.
The .ini file, in our case, ShotgunMG.ini
contains job scheduler, software, database paths and third party packages specific parameters that will be used at each step of the pipeline. The first few lines of the .ini file contain general definitions mainly related to the job scheduler:
[DEFAULT]
cluster_submit_cmd=sbatch
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 1
cluster_pmem=--mem=12000
# change my_account_name with real account name if running on SLURM.
cluster_other_arg=--account=my_account_name --export=ALL --mail-type=END --mail-user=$JOB_MAIL
cluster_queue=
cluster_work_dir_arg=-D
cluster_output_dir_arg=-o
cluster_job_name_arg=-J
cluster_cmd_produces_job_id=true
# the clusterSubmitCmdSuffix must be set to | grep \"[0-9]\" if clusterCmdProducesJobId=true
cluster_submit_cmd_suffix= | grep -o "[0-9]*"
cluster_dependency_arg=-d afterok:
cluster_dependency_sep=:
raw_read_dir=./raw_reads
current_dir=./
job_output_dir=./jobs_output
For the SGE job scheduler, the .ini file would look something like this:
cluster_submit_cmd=jobsub
cluster_walltime=-l h_rt=12:00:0
cluster_cpu=-pe dev 1 -l res_cpus=1
cluster_pmem=-l res_mem=12000
cluster_other_arg=-P nrc_eme -V -j y -l res_image=nrc/nrc_all_default_ubuntu-18.04-amd64_latest
cluster_queue=
cluster_work_dir_arg=-wd
cluster_output_dir_arg=-o
cluster_job_name_arg=-N
cluster_cmd_produces_job_id=true
# the clusterSubmitCmdSuffix must be set to | grep \"[0-9]\" if clusterCmdProducesJobId=true
cluster_submit_cmd_suffix= | grep -o "[0-9]*"
#cluster_dependency_arg=-W depend=afterok:
cluster_dependency_arg=-hold_jid
cluster_dependency_sep=,
extra_java_flags=-XX:ParallelGCThreads=4
raw_read_dir=./raw_reads
current_dir=./
job_output_dir=./jobs_output
# MANDATORY PARAMETERS FOR GPSC/GRIDENGINE
job_scripts_dir=./job_scripts
install_home=/space/project/grdi/eco/bioinfo-tools/nrc
env_modules=/etc/profile.d/modules.sh
tmpdir=/tmp # Not much space in /tmp, consider changing to a path were enough storage is available.
Here the parameters starting with cluster_
defines default values and other values required by SLURM. The raw_reads_dir
specifies where are the raw fastq files located. The job_output_dir
points to the location of all stderr output of each job.
The next chunk of parameters defines which modules will be invoked by the pipeline:
module_perl=nrc/perl/5.26.0
module_python2=nrc/python/2.7.18
module_python3=nrc/python/3.9.0
module_R=nrc/R/3.6.3
module_tools=nrc/nrc_tools/1.3.0
module_rdp_classifier=nrc/rdp_classifier/2.5
module_java=nrc/java/jdk1.8.0_171
module_blast=nrc/blast/2.10.1+
module_trimmomatic=nrc/trimmomatic/0.39
module_bwa=nrc/bwa/0.7.17
module_samtools=nrc/samtools/1.9
module_bedtools=nrc/bedtools/2.23.0
module_exonerate=nrc/exonerate/2.2.0
module_hmmer=nrc/hmmer/3.3.2
module_hdf5=nrc/hdf5/1.8.13
module_pigz=nrc/pigz/2.3.4
module_metabat2=nrc/metabat/2.12.1
module_prodigal=nrc/prodigal/2.6.3
module_checkm=nrc/checkm/1.1.3
module_bowtie2=nrc/bowtie2/2.3.2
module_megahit=nrc/megahit/1.2.9
module_pplacer=nrc/pplacer/1.1
module_diamond=nrc/diamond/2.0.8
module_bbmap=nrc/bbmap/38.11
module_CAT=nrc/CAT/5.2.3
module_maxbin2=nrc/maxbin/2.2.7
module_spades=nrc/SPAdes/3.15.0
module_quast=nrc/quast/5.0.2
module_trnascanse=nrc/trnascan-se/2.0.7
module_barrnap=nrc/barrnap/0.9
module_bedtoolsforbarrnap=nrc/bedtools/2.29.2
module_infernal=nrc/infernal/1.1.3
module_rtk=nrc/rtk/0.93.2
module_fraggenescan=nrc/fraggenescan/1.31
For more information on modules, the reader is invited to consult references on Environment Modules (http://modules.sourceforge.net/).
Next are general (but critical) parameters like sequencing reads configuration, file paths of sequencing adapters, databases, etc. The most consequential parameters are the following: assembler
, mapper
and binner
. ShotgunMG supports the Megahit and Spades assemblers, the bwa-mem and bbmap reads mapper and metabat2 and/or maxbin2 as MAGs/bins generators.
# qc_methods can only be: trimmomatic
qc_methods=trimmomatic
# mapper can be one of these: bwa(i.e. bwa-mem) or bbmap
mapper=bbmap
# assembler can be one of these: megahit or spades.
assembler=megahit
# MAGs generation. Can be one of these: metabat2 or maxbin2 or both with a ":" delimiter.
binner=metabat2:maxbin2
# Annotations. kog is more focused on eukaryotes, so it can be skipped if eukaryotes are not the focus
skip_cog=no
skip_kog=yes
skip_blastp_nr=no
Next are database paths:
[DB]
#if METAGENOME
contaminants=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa
#if METATRANSCRIPTOME
#contaminants=$INSTALL_HOME/databases/RDP_training_sets/greengenes/2014-08-11/gg_13_5_withTaxonomy_corrected_silva_r117_euk_mito_chloro_sortedlength_99pcid_Illumina_contaminants.fasta
And next are all the parameters of each job generated by the pipeline. For instance we have the following values under the [trim]
and [bbduk]
sections:
[trim]
threads=6
trailing_min_quality=30
average_quality=30
min_length=45
# Have a look at the fastqc profiles to make sure you are cropping enough bases at the beginning of the reads.
headcrop=16
# crop is optional. If for some reason you are still stuck with overrepresented kmers at the end, use crop=<in> to force them off.
crop=96
adapter_fasta=$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa
illumina_clip_settings=:2:10:10
cluster_walltime=-t 6:00:0
cluster_cpu=-N 1 -n 6
cluster_pmem=--mem=32000
.
.
.
Important parameters
The most important parameters for the pipeline are listed below. These parameters should always be doubled-checked before running the pipeline. Key parameters under the [DEFAULT]
and [DB]
categories are discussed above, but relisted here. Special attention should be observed for the [trim]
parameters too as accurately qced reads is paramount to achieve a good assembly and consequently get accurate downstream results. It is strongly encourage to perform a FastQC analysis for at least one sequencing library before and after qc to make sure that reads going into co-assembly are of good quality.
[DB]
#if METAGENOME
contaminants=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa
#if METATRANSCRIPTOME
#contaminants=$INSTALL_HOME/databases/RDP_training_sets/greengenes/2014-08-11/gg_13_5_withTaxonomy_corrected_silva_r117_euk_mito_chloro_sortedlength_99pcid_Illumina_contaminants.fasta
phix=$INSTALL_HOME/databases/contaminants/phix174_NC_001422.1.fasta
## If Bacteria and archea
core=$INSTALL_HOME/databases/fasta/greengenes/core_set_aligned.fasta
rdp_training_set=$INSTALL_HOME/databases/RDP_training_sets/silva/2017-07-12/specie/rRNAClassifier.properties
Jobs and their dependencies
One key feature of ShotgunMG (through the underlying GenPipes engine) is its ability to generate jobs and manage the dependency network of each job. If we look at the first two steps representing three jobs that are being written in the commands.sh file, we see the following instructions:
#-------------------------------------------------------------------------------
# STEP: trim
#-------------------------------------------------------------------------------
STEP=trim
mkdir -p $JOB_OUTPUT_DIR/$STEP
#-------------------------------------------------------------------------------
# JOB: trim_1_JOB_ID: trimmomatic_Microbial_Mock_Community_Staggered_B
#-------------------------------------------------------------------------------
JOB_NAME=trimmomatic_Microbial_Mock_Community_Staggered_B
JOB_DEPENDENCIES=
JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Staggered_B.f72f4f889cb8f8cf9fcbfac1758b80b5.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
trim_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/java/jdk1.8.0_144 nrc/trimmomatic/0.39 && \
java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
-threads 6 \
-phred33 \
/project/6008026/projects/misc/shotgunmg_paper/mock_community/./raw_reads/Microbial_Mock_Community_Staggered_B_R1.fastq.gz /project/6008026/projects/misc/shotgunmg_paper/mock_community/./raw_reads/Microbial_Mock_Community_Staggered_B_R2
qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single1.fastq.gz qced_reads/Microbial_Moc
ILLUMINACLIP:/project/6008026/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
TRAILING:30 \
SLIDINGWINDOW:4:15 \
MINLEN:45 \
HEADCROP:16 CROP:96 \
2> qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward Only Surviving: (\d+).*$/Raw Fragments,\1\nFragment Surviving
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
sbatch --account=rrg-jtrembla --export=ALL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 6:00:0 --mem=32000 -N 1 -n 6 | grep -o "[0-9]*")
echo "$trim_1_JOB_ID $JOB_NAME $JOB_DEPENDENCIES $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST
#-------------------------------------------------------------------------------
# STEP: remove_contam
#-------------------------------------------------------------------------------
STEP=remove_contam
mkdir -p $JOB_OUTPUT_DIR/$STEP
#-------------------------------------------------------------------------------
# JOB: remove_contam_1_JOB_ID: bbduk_paired_Microbial_Mock_Community_Staggered_B
#-------------------------------------------------------------------------------
JOB_NAME=bbduk_paired_Microbial_Mock_Community_Staggered_B
JOB_DEPENDENCIES=$trim_1_JOB_ID
JOB_DONE=job_output/remove_contam/bbduk_paired_Microbial_Mock_Community_Staggered_B.3da67d954e46b705fe615ca301d3b78a.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
remove_contam_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/bbmap/38.11 && \
bbduk.sh \
in=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz \
in2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz \
out=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R1.fastq.gz \
out2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R2.fastq.gz \
outm=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R1.fastq \
outm2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R2.fastq \
stats=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.duk_contam_interleaved_log.txt \
k=21 \
minkmerhits=1 \
ref=/project/6008026/databases/contaminants/Illumina.artifacts.fa \
overwrite=true \
threads=1
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
sbatch --account=rrg-jtrembla --export=ALL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 12:00:0 --mem=12000 -N 1 -n 1 -d afterok:$JOB_DEPENDENCIES | grep -o "[0-9]*")
echo "$remove_contam_1_JOB_ID $JOB_NAME $JOB_DEPENDENCIES $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST
The first step STEP: trim
of the pipeline contains a single job called trimmomatic_Microbial_Mock_Community_Staggered_B
and will run the bbduk.sh
script with proper parameters. This job contains no dependency: the variable JOB_DEPENDENCIES=
is not assigned any values and there are no -d afterok:$JOB_DEPENDENCIES
argument passed on to the sbatch
command. This means that this job will be executed as soon as the job scheduler finds appropriate resources to execute it. In that particular case, we are asking for 12 GB of RAM --mem=32000
, a walltime of 12 hrs -t 6:00:0
and 1 compute node and 1 core or cpu -N 1 -n 6
. These values are specified in the ShotgunMG.ini
file under the [trim]
section.
The second step of the pipeline STEP: remove_contam
contains one job labeled bbduk_paired_Microbial_Mock_Community_Staggered_B
, which depends on the sucessful execution of the trimmomatic_Microbial_Mock_Community_Staggered_B
job. In this case, the JOB_DEPENDENCY
variable is assigned the $trim_1_JOB_ID
value (JOB_DEPENDENCIES=$trim_1_JOB_ID
) and the sbatch command contains the -d afterok:$JOB_DEPENDENCIES
argument. This means that SLURM will wait for the $trim_1_JOB_ID
to be sucessfully completed (i.e. exit_status=0
) before submitting the bbduk_paired_Microbial_Mock_Community_Staggered_B
job in the waiting queue.
How are these jobs generated in the ShotgunMG.py script?
The script pipelines/metagenomics/shotgunmg.py
contains the narrative of the pipeline and will create job objects relying on the bio/shotgun_metagenomics.py
and bio/microbial_ecology.py
Python libraries. For intance the first trimmomatic_Microbial_Mock_Community_Staggered_B
job mentionned above is written on lines 77-109 inside the step trim:
```python
def trim(self):
"""
step trim(): Raw fastqs will be trimmed using Trimmomatic. Interleaved fastqs will be generated after trimming.
"""
mkdir_p(os.path.join(self._root_dir, "qced_reads"))
jobs = []
trim_stats = []
for readset in self.readsets: # The readset file was provided in the shotgunmg.py command as an argument where one line = one readset
# The for loop ill iterate over all readset and generate a trim job for each readset.
trim_file_prefix = os.path.join("qced_reads", readset.sample.name, readset.name + ".trim.")
trim_file_prefix_long = os.path.join("qced_reads", readset.sample.name, readset.name + ".trim.long.")
if not os.path.exists(os.path.join("qced_reads", readset.sample.name)):
os.makedirs(os.path.join("qced_reads", readset.sample.name))
if readset.run_type == "PAIRED_END": # If in the readset, the run type is PAIRED_END (majority of time)...
job = shotgun_metagenomics.trimmomatic( # will call the trimmomatic function inside the bio/trimmomatic.py library
readset.fastq1,
readset.fastq2,
trim_file_prefix + "pair1.fastq.gz",
trim_file_prefix + "single1.fastq.gz",
trim_file_prefix + "pair2.fastq.gz",
trim_file_prefix + "single2.fastq.gz",
readset.quality_offset,
trim_file_prefix + "out",
trim_file_prefix + "stats.csv"
)
job.name = "trimmomatic_" + readset.sample.name
job.subname = "trim"
jobs.append(job)
The above excerpt is just an example with one readset, but in reality, the trim
step will generate one job per readset - so many more jobs. This job object is instanciated by calling the trimmomatic function inside the bio/shotgun_metagenomics.py
Python library which is described below:
def trimmomatic(input1, input2, paired_output1, unpaired_output1, paired_output2, unpaired_output2, quality_offset, trim_log, trim_stats):
= Job(
job # Input files - essential to specify for the job dependency network
[input1, input2], #Output files - essential to specify for the job dependency network. This output file will somehow be used as input for the next job definition.
[paired_output1, unpaired_output1, paired_output2, unpaired_output2, trim_log, trim_stats],
['java', 'module_java'],
['trimmomatic', 'module_trimmomatic']
[# modules are defined in the .ini file. In the current case, we ask for the module_tools=nrc/nrc_tools_public/dev (defined in .ini file) to be loaded
]
)
= config.param('trim', 'threads', type='posint') # config.param calls values defined in the .ini fil, here withing [trim] category, threads=6 in the .ini file.
threads = config.param('trim', 'adapter_fasta', type='filepath') # config.param calls values defined in the .ini file. Here withing the [trim] categoryu, adapter_file=$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa (defined in the .ini file)
adapter_file = config.param('trim', 'illumina_clip_settings')
illumina_clip_settings = config.param('trim', 'trailing_min_quality', type='int')
trailing_min_quality = config.param('trim', 'min_length', type='posint')
min_length = config.param('trim', 'headcrop', required=False, type='int')
headcrop = config.param('trim', 'crop', required=False, type='int')
crop = config.param('trim', 'sliding_window1', required=False, type='int')
sliding_window1 = config.param('trim', 'sliding_window2', required=False, type='int')
sliding_window2
if not isinstance( sliding_window1, int ): # Check if sliding_windo1 is defined in the .ini file, if not defined, assigned a default value.
if not(sliding_window1 and sliding_window1.strip()):
= 4
sliding_window1
if not isinstance( sliding_window2, int ):
if not(sliding_window2 and sliding_window2.strip()):
= 15
sliding_window2
# Ppulate the trimmomatic software command itself
= """
job.command java -XX:ParallelGCThreads={threads} -Xmx2G -jar \$TRIMMOMATIC_JAR {mode} \\
-threads {threads} \\
-phred{quality_offset} \\
{input1} {input2} \\
{paired_output1} {unpaired_output1} {paired_output2} {unpaired_output2} \\
ILLUMINACLIP:{adapter_file}{illumina_clip_settings} \\
TRAILING:{trailing_min_quality} \\
SLIDINGWINDOW:{sliding_window1}:{sliding_window2} \\
MINLEN:{min_length} \\
HEADCROP:{headcrop}""".format(
= "PE",
mode = threads,
threads = quality_offset,
quality_offset = input1,
input1 = input2,
input2 = paired_output1,
paired_output1 = paired_output2,
paired_output2 = unpaired_output1,
unpaired_output1 = unpaired_output2,
unpaired_output2 =adapter_file,
adapter_file=illumina_clip_settings,
illumina_clip_settings=trailing_min_quality,
trailing_min_quality= min_length,
min_length = sliding_window1,
sliding_window1 = sliding_window2,
sliding_window2 = str(headcrop)
headcrop
) # If crop is defined in the .ini file, we add it to the job.command string/attribute
if isinstance( crop, int ):
+= """ CROP:{crop}""".format(crop = str(crop))
job.command += " \\\n 2> " + trim_log
job.command # Compute statistics
+= " && \\\ngrep ^Input " + trim_log + " | perl -pe 's/^Input Read Pairs: (\\d+).*Both Surviving: (\\d+).*Forward Only Surviving: (\\d+).*$/Raw Fragments,\\1\\nFragment Surviving,\\2\\nSingle Surviving,\\3/' > " + trim_sta
job.command
# finally return job object.
return job
The 9 arguments passed to the trimmomatic
function are the R1 and R2 fastq.gz files, followed by the 4 output files that trimmomatic will generate. We then specify the quality offset (usually phred 33) one outdir and one filepath that will contain some trimming statistics. These 9 arguments will be used to populate the trimmomatic job arguments. It is critical to specify which are the inputs and outputs of each job as these values are used by the WMS to manage job dependencies when generating jobs for the SLURM or SGE scheduler. In the case of this particular job, once the trimmomatic command sucessfully executes and returns an exit status of zero, 6 files will have been generated : 4 fastq.gz + 1 log + 1 statistic file. Altogeter these last touched files represent the output file defined in the Job object definition and as such is important since it is also used as inputs for the next job bbduk_paired_<readset_name>
, so that the dependency network can be properly built.
Summary
To summarize (Figure 1), the main pipeline worflow is defined in shotgunmg.py
where each Job object will be instanciated and defined in either bio/shotgun_metagenomics.py
or bio/microbial_ecology.py
. Customizable parameters defined in the job definition of these two Python libraries are defined in the .ini file (ShotgunMG.ini
).
2- Workflow walkthrough
Since a complete run of the pipeline can generate hundreds of jobs, we will focus on the jobs of steps 1 to 3 (trim, remove_contam, assembly) and then step 5 (abundance) and finally step 17 (binning).In the next section, we will go through these selected steps of the ShotgunMG pipeline and describe the results that are generated in each of these steps. For this example, the pipeline was run with paired-end fastqs configuration coming from a bacterial mock community.
Description of results generated by each job
For the sake of simplicity, only the commands themselves will be shown. For more details, the reader is referred to the complete commands.sh
file example included in the supplementary material of the ShotgunMG website jtremblay.github.io/shotgunmg.html.
Step 1 - trim
This step implements trimmomatic in the objective of removing sequencing adapters from each read and performing quality control to remove reads of low quality. A particular emphasis should be put in appropriately head and tail cropping the reads. Running FastQC on reads before and after trimming should be done to validate trimming step.
JOB: trimmomatic_<readset_name>
module load nrc/java/jdk1.8.0_144 nrc/trimmomatic/0.39 && \
java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
-threads 6 \
-phred33 \
/raw_reads/Microbial_Mock_Community_Staggered_B_R1.fastq.gz ./raw_reads/Microbial_Mock_Community_Staggered_B_R2
.fastq.gz \
qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single2.fastq.gz \
ILLUMINACLIP:$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
TRAILING:30 \
SLIDINGWINDOW:4:15 \
MINLEN:45 \
HEADCROP:16 CROP:96 \
2> qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward Only Surviving: (\d+).*$/Raw Fragments,\1\nFragment Surviving,\2\nSingle Surviving,\3/' > qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.stats.csv
NRC_STATE=\$PIPESTATUS
Purpose: To perform quality control to remove reads of low quality and trimming sequencing adapters.
.ini:
[trim]
threads=6
trailing_min_quality=30
average_quality=30
min_length=45
# Have a look at the fastqc profiles to make sure you are cropping enough bases at the beginning of the reads.
headcrop=16
# crop is optional. If for some reason you are still stuck with overrepresented kmers at the end, use crop=<in> to force them off.
crop=96
adapter_fasta=$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa
illumina_clip_settings=:2:10:10
cluster_walltime=-t 6:00:0
cluster_cpu=-N 1 -n 6
cluster_pmem=--mem=32000
Step 2- remove_contam
This step invoke bbduk (bbtools) to remove known sequencing contaminants and PhiX reads from qced reads obtained from the trim step. The remaining reads ./ShotgunMG/duk/ncontam_nphix.fastq
will be used for the downstream jobs.
JOB: bbduk_paired_
module load nrc/bbmap/38.11 && \
bbduk.sh \
in=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz \
in2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz \
out=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R1.fastq.gz \
out2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R2.fastq.gz \
outm=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R1.fastq \
outm2=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.contam_R2.fastq \
stats=qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.duk_contam_interleaved_log.txt \
k=21 \
minkmerhits=1 \
ref=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa \
overwrite=true \
threads=1
Purpose:
To scan fastq libraries for known sequencing contaminants and remove them.
.ini:
[DEFAULT]
contaminants=$INSTALL_HOME/databases/contaminants/Illumina.artifacts.fa
[bbduk]
k=21
c=1
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 1
cluster_pmem=--mem=12000
Step 3- assembly
For the sake of simplicity, the above two steps only showed command examples for one readset. Here for the co-assembly job we show the command with the full readset (6 readsets). Assembly can be done using either spades or megahit. Megahit is much faster than spades and often proved to be the only realist package to use to assemble large sequencing datasets.
JOB: assembly
module load nrc/megahit/1.2.9 && \
rm assembly -rf && \
megahit -t 32 --k-min 31 --k-max 131 --k-step 10 \
--min-contig-len 1000 \
-1 qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Even_C/Microbial_Mock_Community_Even_C.ncontam_paired_R1.fastq.gz,qced_reads/Microbia
l_Mock_Community_Staggered_A/Microbial_Mock_Community_Staggered_A.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Staggered_C/Microbial_Mock_Community_Staggered_C.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.ncontam_paired_R1.fastq.gz,qced_reads/Microbial_Mock_Community_Even_B/Microbial_Mock_Community_Even_B.ncontam_paired_R1.fastq.gz \
-2 qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Even_C/Microbial_Mock_Community_Even_C.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Staggered_A/Microbial_Mock_Community_Staggered_A.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Staggered_C/Microbial_Mock_Community_Staggered_C.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.ncontam_paired_R2.fastq.gz,qced_reads/Microbial_Mock_Community_Even_B/Microbial_Mock_Community_Even_B.ncontam_paired_R2.fastq.gz \
--out-dir assembly && \
sed -i 's/^>\\(\\S\\+\\) .*/>\\1/' assembly/final.contigs.fa && \
ln -s -f final.contigs.fa Contigs.fasta && mv Contigs.fasta ./assembly/
Purpose:
To generate assembled contigs from short reads data.
.ini:
[megahit]
# Megahit will use 0.9 pc of the node's memory by default.
min_contig_length=1000
num_threads=32
kmin=31
kmax=131
kstep=10
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 32
cluster_pmem=--mem=514500
Step 4- gene_prediction
Use Prodigal to predict ORFs on contigs
Step 5- abundance
Map qced reads on contigs + generate contigs and gene abundance matrices.
Step 6- exonerate
Split co-assembly and predicted genes fasta files in smaller chunks to perform downstream annotations.
Step 7- diamond_blastp_kegg
DIAMOND BLASTp of predicted genes vs KEGG db.
Step 8- rpsblast_cog
RPSBLAST of predicted genes vs COG db
Step 9- rpsblast_kog
RPSBLAST of predicted genes vs KOG db
Step 10- hmmscan_pfam
HMMSCAN of predicted genes vs PFAM-A DB.
Step 11- diamond_blastp_nr
DIAMOND BLASTp of predicted genes vs NCBI nr
Step 12- ncrna
scan for non-coding RNA in contigs: rRNA and tRNA sequence.
Step 13- taxonomic_annotation
Using DIAMOND blastp results generated in the diamond_blastp vs nr step, each contig will be assigned a taxonomic lineage using the CAT package. Taxonomic summaries and figures are then generated.
Step 14- beta_diversity
Compute beta-diversity on contig abundance and gene abundance matrices
Step 15- alpha_diversity
Compute alpha-diversity (RTK package) on contig abundance and gene abundance matrices
Step 16- finalize
Merge all annotations data generated in steps 7-11 and step 13 in a single gff and tsv file
Step 17- binning
Generates MAGs/bins with the metabat2 and/or maxbin2 software.
3- Smart restart mechanism
One very useful feature of ShotgunMG (and GenPipes-based pipelines) is its built-in smart-restart mechanism. For instance, let’s generate jobs for the first step of the pipeline by running the following command : ShotgunMG.py -c ./ShotgunMG.ini -s 1 -o . -j slurm -r readset.tsv -z ./pipeline.json > ./commands.sh
and let’s have look at the job definition:
#-------------------------------------------------------------------------------
# JOB: trim_1_JOB_ID: trimmomatic_Microbial_Mock_Community_Staggered_B
#-------------------------------------------------------------------------------
JOB_NAME=trimmomatic_Microbial_Mock_Community_Staggered_B
JOB_DEPENDENCIES=
JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Staggered_B.f72f4f889cb8f8cf9fcbfac1758b80b5.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
trim_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/java/jdk1.8.0_144 nrc/trimmomatic/0.39 && \
java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
-threads 6 \
-phred33 \
./raw_reads/Microbial_Mock_Community_Staggered_B_R1.fastq.gz ./raw_reads/Microbial_Mock_Community_Staggered_B_R2.fastq.gz \
qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single1.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.pair2.fastq.gz qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.single2.fastq.gz \
ILLUMINACLIP:$INSTALL_HOME/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
TRAILING:30 \
SLIDINGWINDOW:4:15 \
MINLEN:45 \
HEADCROP:16 CROP:96 \
2> qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward Only Surviving: (\d+).*$/Raw Fragments,\1\nFragment Surviving,\2\nSingle Surviving,\3/' > qced_reads/Microbial_Mock_Community_Staggered_B/Microbial_Mock_Community_Staggered_B.trim.stats.csv
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
sbatch --account=my-account --export=ALL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 6:00:0 --mem=32000 -N 1 -n 6 | grep -o "[0-9]*")
echo "$trim_1_JOB_ID $JOB_NAME $JOB_DEPENDENCIES $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST
The value stored in the JOB_DONE variable JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Staggered_B.f72f4f889cb8f8cf9fcbfac1758b80b5.nrc.done
is actually an md5 signature of all the files creation date, names and sizes combined with all the job’s parameters used to generate the job. Once properly executed on the compute cluster and an exit_status=0
is returned, the file defined by the JOB_DONE variable will be touched (i.e. created) (if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
).
Once this job has successfully completed, if we decide to run the first step of the pipeline again (ShotgunMG.py -c ./ShotgunMG.ini -s 1 -o . -j slurm -r readset.tsv -z ./pipeline.json > ./commands.sh
), we’ll find that no job was written in the standard output commands.sh
file. This is because during the job generation process, the pipeline detected that this job was up to date, because:
1) infiles (fastq.gz files) of the job were older than the outfiles (.pair1.fastq.gz, .pair2.fastq.gz, .single1.fastq.gz, single2.fastq.gz
) it generated,
2) no changes whatsoever were done on the input files and in consequence,
3) the file size of the input files remained the same.
4) There were no modifications in the parameters in the .ini file under the [trim] section. Overall, the md5 signature is identical between the executed job and the new Job object and in consequence, no new job is created. If we would have modified the header of one of the fastq headers in one of the fastq file and had rerun the pipeline, this job would have been re-generated, because the pipeline would have found differences in the old and new md5 signed .done
file. This mechanism is useful if for some reason we realized that one of the parameter given to trimmomatic was not optimal. For instance, if we would increase the crop value from 96 to 100, upon re-execution of shotgunmg.py
, all the trimmomatic jobs would be regenerated because the md5 signature of the old .done
file would not match the new md5 signature of the new job anymore - in other words, the job would not be up-to-date anymore and would in consequence be re-generated. This mechanism is also useful in the context of random job failure because of a faulty compute node or (as described above) if the pipeline operator is interested in modifying one or many variables of a given step of the pipeline.
Another instance where this mechanism is useful is when a job crashes because we didn’t asked for enough memory or cores for its execution and SLURM detects this and kills off our job because the limits of the resources we asked for were not respected. When that happens, an exit_status other than 0 is returned for that offending job and the rest of the downstream dependent jobs are then killed by SLURM. Instead of manually going through each single job that went through to find which one returned a non-zero exit_status, the pipeline operator can simply rerun ShotgunMG.py
with the same parameters and see which jobs are generated.
4- End results
The arguably most important results generated by the pipeline are the following files:
Read counts
Contains read count summary throughout the different steps of the pipeline.
contig_abundance/qc_mappping_stats.tsv
Co-assembly
Co-assembly fasta file:
assembly/Contigs.fasta
Statistics and profiles of co-assembly.
assembly/assembly_stats.txt
assembly/quast/report.html
Gene prediction
Predicted genes in fna, faa and gff formats:
gene_prediction/Contigs_renamed.fna
gene_prediction/Contigs_renamed.faa
gene_prediction/Contigs_renamed.gff
Contigs and genes abundance matrices
Text files in .tsv format. Un normalized (absolute number of qced reads that properly aligned to genes or contigs):
gene_abundance/merged_gene_abundance.tsv
contig_abundance/merged_contig_abundance.tsv
Normalized abundance - using edgeR package with TMM method - gives Count Per Million (CPM):
gene_abundance/merged_gene_abundance_cpm.tsv
contig_abundance/merged_contig_abundance_cpm.tsv
Annotations
All merged annotations (except ncrna) combined in a .tsv file
annotations/annotations.tsv
Alpha diversity
Contains the alpha diversity values. Observed contigs/genes (richness), shannon, etc.
alpha_div/contig_abundance/alpha_richness.tsv
alpha_div/contig_abundance/alpha_shannon.tsv
alpha_div/gene_abundance/alpha_richness.tsv
alpha_div/gene_abundance/alpha_shannon.tsv
Beta diversity
Contains the beta diversity (Bray-Curtis dissimilarity) distance matrices, PCoAs and interactive figures computed from contig and gene abundance matrices. Interactive 3d figures that can be opened in most internet browsers. Beta diversity metrics are generated using microbiomeutils (https://github.com/jtremblay/microbiomeutils)
betadiv/bray_curtis_contig_bacteriaArchaea/3d_bray_curtis_plot/index.html
betadiv/bray_curtis_contig_abundance/3d_bray_curtis_plot/index.html
betadiv/bray_curtis_gene_abundance/3d_bray_curtis_plot/index.html
Distance matrices.
betadiv/bray_curtis_contig_abundance/bray_curtis_contig_abundance.tsv
betadiv/bray_curtis_contig_bacteriaArchaea/bray_curtis_contig_bacteriaArchaea.tsv
betadiv/bray_curtis_gene_abundance/bray_curtis_gene_abundance.tsv
Principal coordinates analysis tables computed from the distance matrices described above.
betadiv/bray_curtis_contig_abundance/bray_curtis_contig_abundance_coords.tsv
betadiv/bray_curtis_contig_bacteriaArchaea/bray_curtis_contig_bacteriaArchaea_coords.tsv
betadiv/bray_curtis_gene_abundance/bray_curtis_gene_abundance_coords.tsv
DIAMOND BLASTp output of genes vs KEGG genes DB
annotations/blastp_kegg.tsv
annotations/blastp_kegg_parsed.tsv
DIAMOND BLASTp output of genes vs NCBI nr DB
annotations/blastp_nr.tsv
hmmscan output of genes vs PFAM-A DB.
Three different types of output are provided. Please consult HMMer 3 documentation for more details.
annotations/hmmscan_pfam_domtblout.tsv
annotations/hmmscan_pfam_pfamtblout.tsv
annotations/hmmscan_pfam_tblout.tsv
tRNA (tRNAscan) and rRNA (barrnap) abundance/annotations
# tRNA
annotations/trna/trna*tsv
# rRNA
annotations/rrna/rdp/*
annotations/rrna/barrnap/*
annotations/rrna/abundance/*
annotations/rrna/abundance/arc_SSU_merged_abundance.tsv
annotations/rrna/abundance/bac_LSU_merged_abundance.tsv
annotations/rrna/abundance/bac_SSU_merged_abundance.tsv
annotations/rrna/abundance/euk_LSU_merged_abundance.tsv
annotations/rrna/abundance/euk_SSU_merged_abundance.tsv
Taxonomic profiles
Taxonomic profile tables at L1=kindgom, L2=phylum, L3=class, L4=order, L5=family and L6=genus. Both in absolute (consensus rarefaction) or relative. abundance.
# All organisms
annotations/taxonomy_consensus/all/absolute/
annotations/taxonomy_consensus/all/relative/
# Bacteria/Archaea only
annotations/taxonomy_consensus/bacteriaArchaea/absolute/
annotations/taxonomy_consensus/bacteriaArchaea/relative/
# Non-bacterial/archaeal:
annotations/taxonomy_consensus/others/absolute/
annotations/taxonomy_consensus/others/relative/
5- Job output folder
When a job is launched/submitted whether on the job scheduler or directly on the terminal with the batch
option, the standard output and error are redirected and written in the ./job_output/
folder in the step subfolder that this job belongs to. For instance, consider the following job, where we have a JOB_OUTPUT_DIR
set to ./job_output
. The trim step (step 1) will generate a subdirectory :
STEP=trim
mkdir -p $JOB_OUTPUT_DIR/$STEP
For instance, if we look at the trimmomatic_Microbial_Mock_Community_Even_A
job, we have the following instructions which will define the file that will hold the standard output and error:
JOB_NAME=trimmomatic_Microbial_Mock_Community_Even_A
JOB_DEPENDENCIES=
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
Finally, when the job is submitted to the job scheduler with the following command: sbatch --account=my_account_xyz --export=ALL --mail-type=END --mail-user=$JOB_MAIL -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -t 6:00:0 --mem=32000 -N 1 -n 6
, the -o $JOB_OUTPUT
instructs the job scheduler were to write the standard output and error.
Here is the complete chunk of instructions:
OUTPUT_DIR=/fs/hnas1-evs1/Dgrdi/grdi_eco/training/shotgun_metagenomics_20201125/mock_community_shotgunMG/megahit_bwa
JOB_OUTPUT_DIR=$OUTPUT_DIR/job_output
TIMESTAMP=`date +%FT%H.%M.%S`
JOB_LIST=$JOB_OUTPUT_DIR/Metagenomics_job_list_$TIMESTAMP
mkdir -p $OUTPUT_DIR
cd $OUTPUT_DIR
#-------------------------------------------------------------------------------
# STEP: trim
#-------------------------------------------------------------------------------
STEP=trim
mkdir -p $JOB_OUTPUT_DIR/$STEP
#-------------------------------------------------------------------------------
# JOB: trim_1_JOB_ID: trimmomatic_Microbial_Mock_Community_Even_A
#-------------------------------------------------------------------------------
JOB_NAME=trimmomatic_Microbial_Mock_Community_Even_A
JOB_DEPENDENCIES=
JOB_DONE=job_output/trim/trimmomatic_Microbial_Mock_Community_Even_A.e8414ede7cfd9241c50df6291e44ab89.nrc.done
JOB_OUTPUT_RELATIVE_PATH=$STEP/${JOB_NAME}_$TIMESTAMP.o
JOB_OUTPUT=$JOB_OUTPUT_DIR/$JOB_OUTPUT_RELATIVE_PATH
trim_1_JOB_ID=$(echo "#!/bin/sh
rm -f $JOB_DONE && \
set -o pipefail
module load nrc/java/jdk1.8.0_171 nrc/trimmomatic/0.39 && \
java -XX:ParallelGCThreads=6 -Xmx2G -jar \$TRIMMOMATIC_JAR PE \
-threads 6 \
-phred33 \
/fs/hnas1-evs1/Dgrdi/grdi_eco/training/shotgun_metagenomics_20201125/mock_community_shotgunMG/megahit_bwa/./raw_reads//Microbial_Mock_Community_Even_A_R1.fastq.gz /fs/
qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.trim.pair1.fastq.gz qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even
ILLUMINACLIP:/space/project/grdi/eco/bioinfo-tools/nrc/databases/contaminants/adapters-nextera-xt.fa:2:10:10 \
TRAILING:30 \
SLIDINGWINDOW:4:15 \
MINLEN:45 \
HEADCROP:16 CROP:96 \
2> qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.trim.out && \
grep ^Input qced_reads/Microbial_Mock_Community_Even_A/Microbial_Mock_Community_Even_A.trim.out | perl -pe 's/^Input Read Pairs: (\d+).*Both Surviving: (\d+).*Forward On
NRC_STATE=\$PIPESTATUS
echo NRC_exitStatus:\$NRC_STATE
if [ \$NRC_STATE -eq 0 ] ; then touch $JOB_DONE ; fi
exit \$NRC_STATE" | \
jobsub -P nrc_eme -V -j y -l res_image=nrc/nrc_all_default_ubuntu-18.04-amd64_latest -wd $OUTPUT_DIR -o $JOB_OUTPUT -N $JOB_NAME -l h_rt=6:00:0 -l res_mem=32000 -pe dev
echo "$trim_1_JOB_ID $JOB_NAME $JOB_DEPENDENCIES $JOB_OUTPUT_RELATIVE_PATH" >> $JOB_LIST
6- Integrating more options in a given bioinformatics package
One useful feature of ShotgunMG (and the GenPipes WMS) is the possibility to customize each step of a workflow. For instance, if we would want to add an option for Megahit, we would need to add a few lines of code to the megahit
function found in the bio/shotgun_metagenomics.py
file. As a reminder, the megahit_R1_R2
function is called from the main pipeline script/wrapper shotgunmg.py
:
= shotgun_metagenomics.megahit_R1_R2(
job
fastq_gz_list_R1,
fastq_gz_list_R2,"assembly"),
os.path.join(
reads_type
)= "megahit"
job.name = "megahit"
job.subname jobs.append(job)
where a job object will be instantiated in the megahit_R1_R2
function of the bio/shotgun_metagenomics.py
library:
def megahit_R1_R2(infiles_R1, infiles_R2, outdir, type):
if(type == "pe"):
= "-1 " + ",".join(infiles_R1);
infiles_string_R1 = "-2 " + ",".join(infiles_R2);
infiles_string_R2
= Job(
job + infiles_R2,
infiles_R1 "Contigs.fasta"), os.path.join(outdir, "final.contigs.fa")],
[os.path.join(outdir,
['megahit', 'module_megahit']
[
]
) ="""
job.commandrm {outdir} -rf && \\
megahit -t {num_threads} --k-min {kmin} --k-max {kmax} --k-step {kstep} \\
--min-contig-len {min_contig_length} \\
{infiles_R1} \\
{infiles_R2} \\
--out-dir {outdir} && \\
sed -i 's/^>\\\(\\\S\\\+\\\) .*/>\\\\1/' final.contigs.fa && \\
ln -s -f final.contigs.fa Contigs.fasta && mv Contigs.fasta ./assembly/""".format(
= config.param("megahit", "min_contig_length", 1, "posint"),
min_contig_length = config.param("megahit", "num_threads", 1, "posint"),
num_threads = config.param("megahit", "kmin", 1, "posint"),
kmin = config.param("megahit", "kmax", 1, "posint"),
kmax = config.param("megahit", "kstep", 1, "int"),
kstep = infiles_string_R1,
infiles_R1 = infiles_string_R2,
infiles_R2 = outdir
outdir
)
return job
In the current implementation, we only included 5 parameters specific to Megahit (min_contig_length, num_threads, kmin, kmax and kstep
. If for instance at some point we realize that it would be useful to implement support for Megahit’s --prune-depth <int>
option, we would need to add a line specifying this option in the megahit_R1_R2
function:
def megahit_R1_R2(infiles_R1, infiles_R2, outdir, type):
if(type == "pe"):
= "-1 " + ",".join(infiles_R1);
infiles_string_R1 = "-2 " + ",".join(infiles_R2);
infiles_string_R2
= Job(
job + infiles_R2,
infiles_R1 "Contigs.fasta"), os.path.join(outdir, "final.contigs.fa")],
[os.path.join(outdir,
['megahit', 'module_megahit']
[
]
) ="""
job.commandrm {outdir} -rf && \\
megahit -t {num_threads} --k-min {kmin} --k-max {kmax} --k-step {kstep} \\
--min-contig-len {min_contig_length} \\
{infiles_R1} \\
{infiles_R2} \\
--out-dir {outdir} && \\
sed -i 's/^>\\\(\\\S\\\+\\\) .*/>\\\\1/' assembly/final.contigs.fa && \\
ln -s -f final.contigs.fa Contigs.fasta && mv Contigs.fasta ./assembly/""".format(
= config.param("megahit", "min_contig_length", 1, "posint"),
min_contig_length = config.param("megahit", "num_threads", 1, "posint"),
num_threads = config.param("megahit", "kmin", 1, "posint"),
kmin = config.param("megahit", "kmax", 1, "posint"),
kmax = config.param("megahit", "kstep", 1, "int"),
kstep = config.param("megahit", "prune_level", 1, "int"), #!! <--- Here we added a new option
prune_level = infiles_string_R1,
infiles_R1 = infiles_string_R2,
infiles_R2 = outdir
outdir
)
return job
We would also need to add the option prune_level
in the .ini
file under the [megahit]
category.
[megahit]
# Megahit will use 0.9 pc of the node's memory by default.
min_contig_length=1000
num_threads=32
kmin=31
kmax=131
kstep=10
prune_level=3
cluster_walltime=-t 12:00:0
cluster_cpu=-N 1 -n 32
cluster_pmem=--mem=514500
We should be ready to go with this new option implementation. Adding an option for Megahit is a simple case scenario as it consists in the simple implementation of an additional argument to the megahit command.