For this tutorial, we’ll process a 10X genomics single cell sequencing data from the BioProject PRJNA949613 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA949613) using the R Seurat package. Note that if you intend to use all the sequencing data from this project, you will need a machine with a considerable amount of RAM. As scRNA data types are getting increasingly larger with the latest flavor of Illumina sequencing technology, it is unclear if the Seurat package in its current form will be adequate to digest these modern and larger datasets. In the current example/tutorial, we’ll downsize the size of each single cell abundance matrix to 1,500 cells. The full dataset has approx. 20,000 cells per sample.
module load nrc/sratoolkit/3.0.2
fastq-dump --split-files --outdir ./raw_reads/ SRR23995222
fastq-dump --split-files --outdir ./raw_reads/ SRR23995223
fastq-dump --split-files --outdir ./raw_reads/ SRR23995224
fastq-dump --split-files --outdir ./raw_reads/ SRR23995225
The whole project contains more fastqs, but only these 4 libraries are needed for this tutorial.
SRR23995226, SRR23995227, SRR23995228, SRR23995229, SRR23995230, SRR23995231,
SRR23995232, SRR23995233, SRR23995234, SRR23995235, SRR23995236, SRR23995237
These latter libraries contain 3 fastqs per accession number because they each contain multiple sample per accession #. Demultiplexing would therefore be necessary before proceeding.
For our 4 libraries, here is the associated metadata:
#SampleID Env CellType Treatment
SRR23995222 tumor_draining_lymph_nodes Cd69 Cd69_TDLN
SRR23995223 tumor_draining_lymph_nodes Cd69 Cd69_TDLN
SRR23995224 tumor_draining_lymph_nodes WT WT_TDLN
SRR23995225 tumor_draining_lymph_nodes WT WT_TDLN
This is not mandatory, but I renamed each fastq according to the naming scheme specified by Illumina.
ls -l raw_reads/
total 342575426
-rw-rw-r-- 1 jtrembla jtrembla 49029502344 Apr 13 14:14 Cd69-TDLN-rep1_S1_L001_R1_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 99075609546 Apr 13 14:20 Cd69-TDLN-rep1_S1_L001_R2_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 44795129760 Apr 13 14:22 Cd69-TDLN-rep2_S2_L001_R1_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 90538568046 Apr 13 14:28 Cd69-TDLN-rep2_S2_L001_R2_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 47692915892 Apr 13 14:31 WT-TDLN-rep1_S1_L001_R1_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 96380878796 Apr 13 14:36 WT-TDLN-rep1_S1_L001_R2_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 47262116084 Apr 13 14:39 WT-TDLN-rep2_S2_L001_R1_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 95512330796 Apr 13 14:45 WT-TDLN-rep2_S2_L001_R2_001.fastq
Once the fastqs are downloaded and renamed, the SingleCell pipeline (Nextflow) can be executed. Just make sure to double check the parameters in the singlecell.config file (lines 26 to 30). You also need to have the STAR v2.7.10b software installed and available through environment modules. Java and Nextflow also need to be properly installed and functional.
26 raw_reads = "$projectDir/raw_reads/*_R{1,2}_001.fastq"
27 outdir = "$projectDir/output/"
28 mapping_file = "$projectDir/mapping_file.tsv"
29 ref_genome = "$INSTALL_HOME/databases/single_cell/refdata-gex-mm10-2020-A_alt"
30 whitelist = "$INSTALL_HOME/databases/single_cell/3M-february-2018.txt"
The mouse genome also needs to be build with STAR. For instance:
module load nrc/STAR
STAR \
--runMode genomeGenerate \
--runThreadN 8 \
--genomeDir ./ \
--genomeFastaFiles ../refdata-gex-mm10-2020-A/fasta/genome.fa \
--sjdbGTFfile ../refdata-gex-mm10-2020-A/genes/genes.gtf
echo "done."
Once everything is properly setup, the SingleCell pipeline can be run:
module load nrc/nextflow/22.10.7.5853 java/17.0.2
nextflow run -c ./singlecell.config ./singlecell.nf -resume
STAR is much faster than cellranger and completes fairly quickly. In this case, each library was processed with 16 threads under a total of 64 GB RAM and less than 6 hrs.
Once completed, you should have the following in your output directory. The latest version of my SingleCell pipeline also includes .h5 files (1 per sample)
ls output/star/
Cd69-TDLN-rep1_S1_L001_barcodes.tsv Cd69-TDLN-rep1_S1_L001_Summary.csv Cd69-TDLN-rep2_S2_L001_matrix.mtx WT-TDLN-rep1_S1_L001_features.tsv WT-TDLN-rep2_S2_L001_Features.stats
Cd69-TDLN-rep1_S1_L001_Features.stats Cd69-TDLN-rep2_S2_L001_barcodes.tsv Cd69-TDLN-rep2_S2_L001_Summary.csv WT-TDLN-rep1_S1_L001_matrix.mtx WT-TDLN-rep2_S2_L001_features.tsv
Cd69-TDLN-rep1_S1_L001_features.tsv Cd69-TDLN-rep2_S2_L001_Features.stats WT-TDLN-rep1_S1_L001_barcodes.tsv WT-TDLN-rep1_S1_L001_Summary.csv WT-TDLN-rep2_S2_L001_matrix.mtx
Cd69-TDLN-rep1_S1_L001_matrix.mtx Cd69-TDLN-rep2_S2_L001_features.tsv WT-TDLN-rep1_S1_L001_Features.stats WT-TDLN-rep2_S2_L001_barcodes.tsv WT-TDLN-rep2_S2_L001_Summary.csv
This is pretty much all the bioinformatics heavy lifting that has to be done.
You can then download the data locally or use a visual node on your local compute cluster provider.
library(Seurat)
library(Matrix)
library(DoubletFinder)
library(cowplot)
library(dplyr)
library(reshape2)
library(ggplot2)
library(harmony)
library(reticulate)
library(rafalib)
library(clustree)
library(edgeR)
library(MAST)
library(ggpubr)
library(enrichR)
library(msigdbr)
library(fgsea)
library(DropletUtils)
library(org.Mm.eg.db)
library(data.table)
setwd("~/Projects/scrna_ex/")
knitr::opts_knit$set(root.dir = '~/Projects/scrna_ex/')
# load annotations. We'll change ENSEMBL IDs with gene name ids.
# geneInfo.tab comes from the same folder used for STAR genome indexing
gene_info = data.frame(fread("./data/geneInfo.tab", sep="\t", skip=1, header=F))
mtx_files = list.files("./output/star/", pattern="*.mtx")
barcodes_files = list.files("./output/star/", pattern="*_barcodes.tsv")
features_files = list.files("./output/star/", pattern="*_features.tsv")
for(i in 1:length(mtx_files)){
# define file paths
mtx_file = mtx_files[i]
base = basename(mtx_file)
base = gsub("_matrix.mtx", "", base)
barcodes_file = file.path("./output/star/", paste0(base, "_barcodes.tsv"))
features_file = file.path("./output/star/", paste0(base, "_features.tsv"))
mtx_file = paste0("./output/star/", mtx_file)
# read files
df = readMM(mtx_file)
barcodes = read.table(barcodes_file, header=F)$V1
features = read.table(features_file, header=F)$V1
# Here we assume that gene_info is the exact same order as features
if(identical(features, gene_info$V1) == TRUE){
features = gene_info$V2
}else{
break("features and gene_info are not identical.")
}
row.names(df) = features
colnames(df) = barcodes
# Here for the sake of this tutorial, we'll downsize the matrices a little bit
# We'll keep only the first 2000 cells. Of course in real life, we would not
# do this. However, we'd need a machine with more RAM to perform the downstream
# analyses.
df = df[,1:1500]
# Here let's write the files in .h5 format for better portability.
write10xCounts(
paste0("./output/star/", base, ".h5"),
df,
#barcodes = colnames(x),
#gene.id = rownames(x),
#gene.symbol = gene.id,
gene.type = "Gene Expression",
overwrite = TRUE,
type = "HDF5",
genome = "mm10",
version = c("3"),
chemistry = "Single Cell 3' v3",
original.gem.groups = 1L,
library.ids = "custom"
)
df = NULL
df1 = NULL
gc()
}
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
# Load h5 files
cd69_rep1 <- Seurat::Read10X_h5(filename = "./output/star/Cd69-TDLN-rep1_S1_L001.h5", use.names = T)
cd69_rep2 <- Seurat::Read10X_h5(filename = "./output/star/Cd69-TDLN-rep2_S2_L001.h5", use.names = T)
wt_rep1 <- Seurat::Read10X_h5(filename = "./output/star/WT-TDLN-rep1_S1_L001.h5", use.names = T)
wt_rep2 <- Seurat::Read10X_h5(filename = "./output/star/WT-TDLN-rep2_S2_L001.h5", use.names = T)
# create Seurat objects
sdata.cd69_rep1 <- CreateSeuratObject(cd69_rep1, project = "cd69_1")
sdata.cd69_rep2 <- CreateSeuratObject(cd69_rep2, project = "cd69_2")
sdata.wt_rep1 <- CreateSeuratObject(wt_rep1, project = "wt_1")
sdata.wt_rep2 <- CreateSeuratObject(wt_rep2, project = "wt_2")
# create metadata
sdata.cd69_rep1$type = "cd69"
sdata.cd69_rep2$type = "cd69"
sdata.wt_rep1$type = "wt"
sdata.wt_rep2$type = "wt"
# Then merge as one single object:
alldata <- merge(sdata.cd69_rep1,
c(
sdata.cd69_rep2,
sdata.wt_rep1,
sdata.wt_rep2
),
add.cell.ids = c("cd69_rep1", "cd69_rep2", "wt_rep1", "wt_rep2")
)
# Remove unused objects and run garbage collector:
rm(sdata.cd69_rep1, sdata.cd69_rep2, sdata.wt_rep1, sdata.wt_rep2)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 9500994 507.5 16402761 876.1 11755314 627.9
## Vcells 88253680 673.4 139586585 1065.0 132911636 1014.1
#Lets's have a look at our newly created Seurat object.
as.data.frame(alldata@assays$RNA@counts[1:10, 1:2])
## cd69_rep1_AAACCCAAGTCCCGAC cd69_rep1_AAACCCACAACCCGCA
## 4933401J01Rik 0 0
## Gm26206 0 0
## Xkr4 0 0
## Gm18956 0 0
## Gm37180 0 0
## Gm37363 0 0
## Gm37686 0 0
## Gm1992 0 0
## Gm37329 0 0
## Gm7341 0 0
head(alldata@meta.data, 10)
## orig.ident nCount_RNA nFeature_RNA type
## cd69_rep1_AAACCCAAGTCCCGAC cd69_1 6082 2415 cd69
## cd69_rep1_AAACCCACAACCCGCA cd69_1 8883 2963 cd69
## cd69_rep1_AAACCCACAGAGATGC cd69_1 9820 2784 cd69
## cd69_rep1_AAACCCACATCTATCT cd69_1 5658 2287 cd69
## cd69_rep1_AAACCCAGTGGAACAC cd69_1 4111 1734 cd69
## cd69_rep1_AAACCCAGTTAGAGAT cd69_1 4399 1790 cd69
## cd69_rep1_AAACCCATCAGAACCT cd69_1 8236 2491 cd69
## cd69_rep1_AAACCCATCATTTGCT cd69_1 18727 4598 cd69
## cd69_rep1_AAACCCATCCAAGCAT cd69_1 3293 1488 cd69
## cd69_rep1_AAACCCATCGACGAGA cd69_1 4593 1761 cd69
str(alldata)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ counts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:15828397] 32 44 68 70 123 125 160 169 202 318 ...
## .. .. .. .. .. ..@ p : int [1:6001] 0 2415 5378 8162 10449 12183 13973 16464 21062 22550 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 48440 6000
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:48440] "4933401J01Rik" "Gm26206" "Xkr4" "Gm18956" ...
## .. .. .. .. .. .. ..$ : chr [1:6000] "cd69_rep1_AAACCCAAGTCCCGAC" "cd69_rep1_AAACCCACAACCCGCA" "cd69_rep1_AAACCCACAGAGATGC" "cd69_rep1_AAACCCACATCTATCT" ...
## .. .. .. .. .. ..@ x : num [1:15828397] 2 2 1 1 1 1 1 1 7 2 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:15828397] 32 44 68 70 123 125 160 169 202 318 ...
## .. .. .. .. .. ..@ p : int [1:6001] 0 2415 5378 8162 10449 12183 13973 16464 21062 22550 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 48440 6000
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:48440] "4933401J01Rik" "Gm26206" "Xkr4" "Gm18956" ...
## .. .. .. .. .. .. ..$ : chr [1:6000] "cd69_rep1_AAACCCAAGTCCCGAC" "cd69_rep1_AAACCCACAACCCGCA" "cd69_rep1_AAACCCACAGAGATGC" "cd69_rep1_AAACCCACATCTATCT" ...
## .. .. .. .. .. ..@ x : num [1:15828397] 2 2 1 1 1 1 1 1 7 2 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ scale.data : num[0 , 0 ]
## .. .. .. ..@ key : chr "rna_"
## .. .. .. ..@ assay.orig : NULL
## .. .. .. ..@ var.features : logi(0)
## .. .. .. ..@ meta.features:'data.frame': 48440 obs. of 0 variables
## .. .. .. ..@ misc : list()
## ..@ meta.data :'data.frame': 6000 obs. of 4 variables:
## .. ..$ orig.ident : chr [1:6000] "cd69_1" "cd69_1" "cd69_1" "cd69_1" ...
## .. ..$ nCount_RNA : num [1:6000] 6082 8883 9820 5658 4111 ...
## .. ..$ nFeature_RNA: int [1:6000] 2415 2963 2784 2287 1734 1790 2491 4598 1488 1761 ...
## .. ..$ type : chr [1:6000] "cd69" "cd69" "cd69" "cd69" ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 4 levels "cd69_1","cd69_2",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:6000] "cd69_rep1_AAACCCAAGTCCCGAC" "cd69_rep1_AAACCCACAACCCGCA" "cd69_rep1_AAACCCACAGAGATGC" "cd69_rep1_AAACCCACATCTATCT" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "SeuratProject"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 4 1 3
## ..@ commands : list()
## ..@ tools : list()
# Save data
dir.create("results", showWarnings = F)
saveRDS(alldata, "./results/seurat_part1.rds")