Background

For this tutorial, we’ll process a 10X genomics single cell sequencing data from the BioProject PRJNA949613 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA949613) using the R Seurat package. Note that if you intend to use all the sequencing data from this project, you will need a machine with a considerable amount of RAM. As scRNA data types are getting increasingly larger with the latest flavor of Illumina sequencing technology, it is unclear if the Seurat package in its current form will be adequate to digest these modern and larger datasets. In the current example/tutorial, we’ll downsize the size of each single cell abundance matrix to 1,500 cells. The full dataset has approx. 20,000 cells per sample.

Part 0 - Setting up the data and running the bioinformatics workflow.

Download the fastqs

module load nrc/sratoolkit/3.0.2
fastq-dump --split-files  --outdir ./raw_reads/ SRR23995222
fastq-dump --split-files  --outdir ./raw_reads/ SRR23995223
fastq-dump --split-files  --outdir ./raw_reads/ SRR23995224
fastq-dump --split-files  --outdir ./raw_reads/ SRR23995225

The whole project contains more fastqs, but only these 4 libraries are needed for this tutorial.

SRR23995226, SRR23995227, SRR23995228, SRR23995229, SRR23995230, SRR23995231, 
SRR23995232, SRR23995233, SRR23995234, SRR23995235, SRR23995236, SRR23995237

These latter libraries contain 3 fastqs per accession number because they each contain multiple sample per accession #. Demultiplexing would therefore be necessary before proceeding.

For our 4 libraries, here is the associated metadata:

#SampleID    Env                         CellType  Treatment
SRR23995222  tumor_draining_lymph_nodes  Cd69      Cd69_TDLN
SRR23995223  tumor_draining_lymph_nodes  Cd69      Cd69_TDLN
SRR23995224  tumor_draining_lymph_nodes  WT        WT_TDLN
SRR23995225  tumor_draining_lymph_nodes  WT        WT_TDLN

This is not mandatory, but I renamed each fastq according to the naming scheme specified by Illumina.

ls -l raw_reads/
total 342575426
-rw-rw-r-- 1 jtrembla jtrembla 49029502344 Apr 13 14:14 Cd69-TDLN-rep1_S1_L001_R1_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 99075609546 Apr 13 14:20 Cd69-TDLN-rep1_S1_L001_R2_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 44795129760 Apr 13 14:22 Cd69-TDLN-rep2_S2_L001_R1_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 90538568046 Apr 13 14:28 Cd69-TDLN-rep2_S2_L001_R2_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 47692915892 Apr 13 14:31 WT-TDLN-rep1_S1_L001_R1_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 96380878796 Apr 13 14:36 WT-TDLN-rep1_S1_L001_R2_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 47262116084 Apr 13 14:39 WT-TDLN-rep2_S2_L001_R1_001.fastq
-rw-rw-r-- 1 jtrembla jtrembla 95512330796 Apr 13 14:45 WT-TDLN-rep2_S2_L001_R2_001.fastq

Once the fastqs are downloaded and renamed, the SingleCell pipeline (Nextflow) can be executed. Just make sure to double check the parameters in the singlecell.config file (lines 26 to 30). You also need to have the STAR v2.7.10b software installed and available through environment modules. Java and Nextflow also need to be properly installed and functional.

26         raw_reads = "$projectDir/raw_reads/*_R{1,2}_001.fastq"
27         outdir = "$projectDir/output/"
28         mapping_file = "$projectDir/mapping_file.tsv"
29         ref_genome = "$INSTALL_HOME/databases/single_cell/refdata-gex-mm10-2020-A_alt"
30         whitelist = "$INSTALL_HOME/databases/single_cell/3M-february-2018.txt"

The mouse genome also needs to be build with STAR. For instance:

module load nrc/STAR
STAR \
    --runMode genomeGenerate \
    --runThreadN 8 \
    --genomeDir ./ \
    --genomeFastaFiles ../refdata-gex-mm10-2020-A/fasta/genome.fa \
    --sjdbGTFfile ../refdata-gex-mm10-2020-A/genes/genes.gtf
echo "done."

Once everything is properly setup, the SingleCell pipeline can be run:

module load nrc/nextflow/22.10.7.5853 java/17.0.2
nextflow run -c ./singlecell.config ./singlecell.nf -resume

STAR is much faster than cellranger and completes fairly quickly. In this case, each library was processed with 16 threads under a total of 64 GB RAM and less than 6 hrs.

Once completed, you should have the following in your output directory. The latest version of my SingleCell pipeline also includes .h5 files (1 per sample)

ls output/star/
Cd69-TDLN-rep1_S1_L001_barcodes.tsv    Cd69-TDLN-rep1_S1_L001_Summary.csv     Cd69-TDLN-rep2_S2_L001_matrix.mtx    WT-TDLN-rep1_S1_L001_features.tsv  WT-TDLN-rep2_S2_L001_Features.stats
Cd69-TDLN-rep1_S1_L001_Features.stats  Cd69-TDLN-rep2_S2_L001_barcodes.tsv    Cd69-TDLN-rep2_S2_L001_Summary.csv   WT-TDLN-rep1_S1_L001_matrix.mtx    WT-TDLN-rep2_S2_L001_features.tsv
Cd69-TDLN-rep1_S1_L001_features.tsv    Cd69-TDLN-rep2_S2_L001_Features.stats  WT-TDLN-rep1_S1_L001_barcodes.tsv    WT-TDLN-rep1_S1_L001_Summary.csv   WT-TDLN-rep2_S2_L001_matrix.mtx
Cd69-TDLN-rep1_S1_L001_matrix.mtx      Cd69-TDLN-rep2_S2_L001_features.tsv    WT-TDLN-rep1_S1_L001_Features.stats  WT-TDLN-rep2_S2_L001_barcodes.tsv  WT-TDLN-rep2_S2_L001_Summary.csv

This is pretty much all the bioinformatics heavy lifting that has to be done.

Downstream analysis with R and the Seurat package.

You can then download the data locally or use a visual node on your local compute cluster provider.

library(Seurat)
library(Matrix)
library(DoubletFinder)
library(cowplot)
library(dplyr)
library(reshape2)
library(ggplot2)
library(harmony)
library(reticulate)
library(rafalib)
library(clustree)
library(edgeR)
library(MAST)
library(ggpubr)
library(enrichR)
library(msigdbr)
library(fgsea)
library(DropletUtils)
library(org.Mm.eg.db)
library(data.table)
setwd("~/Projects/scrna_ex/")

Part 1 - Load data into R

knitr::opts_knit$set(root.dir = '~/Projects/scrna_ex/')

# load annotations. We'll change ENSEMBL IDs with gene name ids.
# geneInfo.tab comes from the same folder used for STAR genome indexing
gene_info = data.frame(fread("./data/geneInfo.tab", sep="\t", skip=1, header=F))

mtx_files = list.files("./output/star/", pattern="*.mtx")
barcodes_files = list.files("./output/star/", pattern="*_barcodes.tsv")
features_files = list.files("./output/star/", pattern="*_features.tsv")

for(i in 1:length(mtx_files)){
  # define file paths
  mtx_file = mtx_files[i]
  base = basename(mtx_file)
  base = gsub("_matrix.mtx", "", base)
  barcodes_file = file.path("./output/star/", paste0(base, "_barcodes.tsv"))
  features_file = file.path("./output/star/", paste0(base, "_features.tsv"))

  mtx_file = paste0("./output/star/", mtx_file)
  
  # read files
  df = readMM(mtx_file)
  barcodes = read.table(barcodes_file, header=F)$V1
  features = read.table(features_file, header=F)$V1
  
  # Here we assume that gene_info is the exact same order as features
  if(identical(features, gene_info$V1) == TRUE){
    features = gene_info$V2
  }else{
    break("features and gene_info are not identical.")
  }
  
  row.names(df) = features
  colnames(df) = barcodes
  # Here for the sake of this tutorial, we'll downsize the matrices a little bit
  # We'll keep only the first 2000 cells. Of course in real life, we would not 
  # do this. However, we'd need a machine with more RAM to perform the downstream 
  # analyses.
  df = df[,1:1500]
  
  # Here let's write the files in .h5 format for better portability. 
  write10xCounts(
    paste0("./output/star/", base, ".h5"),
    df,
    #barcodes = colnames(x),
    #gene.id = rownames(x),
    #gene.symbol = gene.id,
    gene.type = "Gene Expression",
    overwrite = TRUE,
    type = "HDF5",
    genome = "mm10",
    version = c("3"),
    chemistry = "Single Cell 3' v3",
    original.gem.groups = 1L,
    library.ids = "custom"
  )
  df = NULL
  df1 = NULL
  gc()
}

## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.

## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.

## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.

## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.

## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.

## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.

## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.
## The 'cset' argument has been deprecated.
## Please use the argument 'encoding' instead.

## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.
## You created a large dataset with compression and chunking.
## The chunk size is equal to the dataset dimensions.
## If you want to read subsets of the dataset, you should testsmaller chunk sizes to improve read times.

# Load h5 files
cd69_rep1 <- Seurat::Read10X_h5(filename = "./output/star/Cd69-TDLN-rep1_S1_L001.h5", use.names = T)
cd69_rep2 <- Seurat::Read10X_h5(filename = "./output/star/Cd69-TDLN-rep2_S2_L001.h5", use.names = T)
wt_rep1 <- Seurat::Read10X_h5(filename = "./output/star/WT-TDLN-rep1_S1_L001.h5", use.names = T)
wt_rep2 <- Seurat::Read10X_h5(filename = "./output/star/WT-TDLN-rep2_S2_L001.h5", use.names = T)

# create Seurat objects
sdata.cd69_rep1 <- CreateSeuratObject(cd69_rep1, project = "cd69_1")
sdata.cd69_rep2 <- CreateSeuratObject(cd69_rep2, project = "cd69_2")
sdata.wt_rep1 <- CreateSeuratObject(wt_rep1, project = "wt_1")
sdata.wt_rep2 <- CreateSeuratObject(wt_rep2, project = "wt_2")

# create metadata
sdata.cd69_rep1$type = "cd69"
sdata.cd69_rep2$type = "cd69"
sdata.wt_rep1$type = "wt"
sdata.wt_rep2$type = "wt"

# Then merge as one single object:
alldata <- merge(sdata.cd69_rep1,
  c(
    sdata.cd69_rep2,
    sdata.wt_rep1,
    sdata.wt_rep2
  ), 
  add.cell.ids = c("cd69_rep1", "cd69_rep2", "wt_rep1", "wt_rep2")
)

# Remove unused objects and run garbage collector:
rm(sdata.cd69_rep1, sdata.cd69_rep2, sdata.wt_rep1, sdata.wt_rep2)
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  9500994 507.5   16402761  876.1  11755314  627.9
## Vcells 88253680 673.4  139586585 1065.0 132911636 1014.1

#Lets's have a look at our newly created Seurat object.
as.data.frame(alldata@assays$RNA@counts[1:10, 1:2])

##               cd69_rep1_AAACCCAAGTCCCGAC cd69_rep1_AAACCCACAACCCGCA
## 4933401J01Rik                          0                          0
## Gm26206                                0                          0
## Xkr4                                   0                          0
## Gm18956                                0                          0
## Gm37180                                0                          0
## Gm37363                                0                          0
## Gm37686                                0                          0
## Gm1992                                 0                          0
## Gm37329                                0                          0
## Gm7341                                 0                          0

head(alldata@meta.data, 10)

##                            orig.ident nCount_RNA nFeature_RNA type
## cd69_rep1_AAACCCAAGTCCCGAC     cd69_1       6082         2415 cd69
## cd69_rep1_AAACCCACAACCCGCA     cd69_1       8883         2963 cd69
## cd69_rep1_AAACCCACAGAGATGC     cd69_1       9820         2784 cd69
## cd69_rep1_AAACCCACATCTATCT     cd69_1       5658         2287 cd69
## cd69_rep1_AAACCCAGTGGAACAC     cd69_1       4111         1734 cd69
## cd69_rep1_AAACCCAGTTAGAGAT     cd69_1       4399         1790 cd69
## cd69_rep1_AAACCCATCAGAACCT     cd69_1       8236         2491 cd69
## cd69_rep1_AAACCCATCATTTGCT     cd69_1      18727         4598 cd69
## cd69_rep1_AAACCCATCCAAGCAT     cd69_1       3293         1488 cd69
## cd69_rep1_AAACCCATCGACGAGA     cd69_1       4593         1761 cd69

str(alldata)

## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
##   ..@ assays      :List of 1
##   .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
##   .. .. .. ..@ counts       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:15828397] 32 44 68 70 123 125 160 169 202 318 ...
##   .. .. .. .. .. ..@ p       : int [1:6001] 0 2415 5378 8162 10449 12183 13973 16464 21062 22550 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 48440 6000
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:48440] "4933401J01Rik" "Gm26206" "Xkr4" "Gm18956" ...
##   .. .. .. .. .. .. ..$ : chr [1:6000] "cd69_rep1_AAACCCAAGTCCCGAC" "cd69_rep1_AAACCCACAACCCGCA" "cd69_rep1_AAACCCACAGAGATGC" "cd69_rep1_AAACCCACATCTATCT" ...
##   .. .. .. .. .. ..@ x       : num [1:15828397] 2 2 1 1 1 1 1 1 7 2 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ data         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:15828397] 32 44 68 70 123 125 160 169 202 318 ...
##   .. .. .. .. .. ..@ p       : int [1:6001] 0 2415 5378 8162 10449 12183 13973 16464 21062 22550 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 48440 6000
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:48440] "4933401J01Rik" "Gm26206" "Xkr4" "Gm18956" ...
##   .. .. .. .. .. .. ..$ : chr [1:6000] "cd69_rep1_AAACCCAAGTCCCGAC" "cd69_rep1_AAACCCACAACCCGCA" "cd69_rep1_AAACCCACAGAGATGC" "cd69_rep1_AAACCCACATCTATCT" ...
##   .. .. .. .. .. ..@ x       : num [1:15828397] 2 2 1 1 1 1 1 1 7 2 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ scale.data   : num[0 , 0 ] 
##   .. .. .. ..@ key          : chr "rna_"
##   .. .. .. ..@ assay.orig   : NULL
##   .. .. .. ..@ var.features : logi(0) 
##   .. .. .. ..@ meta.features:'data.frame':   48440 obs. of  0 variables
##   .. .. .. ..@ misc         : list()
##   ..@ meta.data   :'data.frame': 6000 obs. of  4 variables:
##   .. ..$ orig.ident  : chr [1:6000] "cd69_1" "cd69_1" "cd69_1" "cd69_1" ...
##   .. ..$ nCount_RNA  : num [1:6000] 6082 8883 9820 5658 4111 ...
##   .. ..$ nFeature_RNA: int [1:6000] 2415 2963 2784 2287 1734 1790 2491 4598 1488 1761 ...
##   .. ..$ type        : chr [1:6000] "cd69" "cd69" "cd69" "cd69" ...
##   ..@ active.assay: chr "RNA"
##   ..@ active.ident: Factor w/ 4 levels "cd69_1","cd69_2",..: 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:6000] "cd69_rep1_AAACCCAAGTCCCGAC" "cd69_rep1_AAACCCACAACCCGCA" "cd69_rep1_AAACCCACAGAGATGC" "cd69_rep1_AAACCCACATCTATCT" ...
##   ..@ graphs      : list()
##   ..@ neighbors   : list()
##   ..@ reductions  : list()
##   ..@ images      : list()
##   ..@ project.name: chr "SeuratProject"
##   ..@ misc        : list()
##   ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
##   .. ..$ : int [1:3] 4 1 3
##   ..@ commands    : list()
##   ..@ tools       : list()

# Save data
dir.create("results", showWarnings = F)
saveRDS(alldata, "./results/seurat_part1.rds")

scRNA tutorial

Julien Tremblay

2023-04-19

Background

Part 0 - Setting up the data and running the bioinformatics workflow.

Download the fastqs

Downstream analysis with R and the Seurat package.

Part 1 - Load data into R