Single Cell Rna Seq Which Read Is My Barcoded Read
Contents
- one Introduction
- ii Reading in 10X Genomics data
- 2.1 From the UMI count matrix
- 2.ii From the molecule data file
- 3 Downsampling on the reads
- 4 Computing barcode ranks
- 5 Detecting empty aerosol
- six Demultiplexing hashed libraries
- 7 Removing swapping effects
- seven.1 Barcode swapping between samples
- 7.2 Chimeric reads within cells
- 8 Session data
Introduction
Droplet-based single-cell RNA sequencing (scRNA-seq) technologies allow researchers to obtain transcriptome-broad expression profiles for thousands of cells at once. Briefly, each jail cell is encapsulated in a droplet in a oil-water emulsion, along with a dewdrop containing contrary transcription primers with a unique barcode sequence. After reverse transcription inside the droplet, each cell'due south cDNA is labelled with that barcode (referred to a "prison cell barcode"). Bursting of the aerosol yields a pool of cDNA for library preparation and sequencing. Debarcoding of the sequences tin and then exist performed to obtain the expression contour for each cell.
This package implements some general utilities for handling these information after quantification of expression. In detail, we focus on the 10X Genomics platform, providing functions to load in the matrix of unique molecule identifier (UMI) counts likewise as the raw molecule information. Functions are besides available for downsampling the UMI count matrix or the raw reads; for distinguishing cells from empty droplets, based on the UMI counts; and to eliminate the furnishings of barcode swapping on Illumina 4000 sequencing machine.
Reading in 10X Genomics data
From the UMI count matrix
The CellRanger pipeline from 10X Genomics will process the raw sequencing information and produce a matrix of UMI counts. Each row of this matrix corresponds to a gene, while each cavalcade corresponds to a prison cell barcode. This is saved in a single directory for each sample, usually named similar <OUTPUT>/outs/filtered_gene_bc_matrices/<GENOME>
1 If you employ the "filtered" matrix, each column corresponds to a putative cell. If you use the "raw" matrix, all barcodes are loaded, and no stardom is fabricated betwixt cells and empty droplets.. Nosotros mock up an example directory below using some simulated data:
# To generate the files. case(write10xCounts, echo=FALSE) dir.proper noun <- tmpdir list.files(dir.name)
## [i] "barcodes.tsv" "genes.tsv" "matrix.mtx"
The matrix.mtx
file contains the UMI counts, while the other two files incorporate the cell barcodes and the factor annotation. We can load this into memory using the read10xCounts
function, which returns a SingleCellExperiment
object containing all of the relevant data. This includes the barcode sequence for each cell (cavalcade), as well every bit the identifier and symbol for each cistron (row).
sce <- read10xCounts(dir.name) sce
## class: SingleCellExperiment ## dim: 100 10 ## metadata(1): Samples ## assays(1): counts ## rownames(100): ENSG00001 ENSG00002 ... ENSG000099 ENSG0000100 ## rowData names(2): ID Symbol ## colnames: NULL ## colData names(2): Sample Barcode ## reducedDimNames(0): ## mainExpName: Nix ## altExpNames(0):
The counts themselves are loaded as a sparse matrix, specifically a dgCMatrix
from the Matrix package. This reduces retentivity usage by just storing the non-zero counts, which is useful for thin scRNA-seq information with lots of dropouts.
class(counts(sce))
## [ane] "dgCMatrix" ## attr(,"package") ## [1] "Matrix"
Users can also load multiple samples at once by supplying a grapheme vector to read10xCounts
. This will return a single SingleCellExperiment
where all of the individual matrices are combined past cavalcade. Apparently, this only makes sense when the aforementioned gear up of genes is being used beyond samples.
From the molecule data file
CellRanger will also produce a molecule data file (molecule_info.h5
) that contains… well, information about the transcript molecules. This includes the UMI sequence 2 For readers who are unfamiliar with UMIs, they permit reads from dissimilar PCR amplicons to be unambiguously assigned to the same original molecule., the cell barcode sequence, the cistron to which it was assigned, and the number of reads covering the molecule. For demonstration purposes, we create an example molecule data file below:
fix.seed(grand) mol.info.file <- DropletUtils:::simBasicMolInfo(tempfile()) mol.info.file
## [1] "/tmp/Rtmpsd303I/file8dc234977af1a"
We can subsequently load this information into our R session using the read10xMolInfo
function:
mol.info <- read10xMolInfo(mol.info.file) mol.info
## $data ## DataFrame with 9532 rows and 5 columns ## cell umi gem_group gene reads ## <character> <integer> <integer> <integer> <integer> ## 1 TGTT 80506 1 18 8 ## 2 CAAT 722585 1 20 6 ## 3 AGGG 233634 1 4 6 ## four TCCC 516870 1 10 nine ## v ATAG 887407 1 6 12 ## ... ... ... ... ... ... ## 9528 TACT 1043995 1 nine 12 ## 9529 GCTG 907401 1 twenty 13 ## 9530 ATTA 255710 i 13 ten ## 9531 GCAC 672962 1 20 11 ## 9532 TGAA 482852 1 1 vi ## ## $genes ## [1] "ENSG1" "ENSG2" "ENSG3" "ENSG4" "ENSG5" "ENSG6" "ENSG7" "ENSG8" ## [9] "ENSG9" "ENSG10" "ENSG11" "ENSG12" "ENSG13" "ENSG14" "ENSG15" "ENSG16" ## [17] "ENSG17" "ENSG18" "ENSG19" "ENSG20"
This information can be useful for quality control purposes, peculiarly when the underlying read counts are required, e.m., to investigate sequencing saturation. Note that the function will automatically guess the length of the barcode sequence, as this is not formally defined in the molecule information file. For virtually experiments, the gauge is correct, merely users can strength the part to employ a known barcode length with the barcode.length
argument.
Downsampling on the reads
Given multiple batches of very different sequencing depths, it can be benign to downsample the deepest batches to friction match the coverage of the shallowest batches. This avoids differences in technical noise that can drive clustering by batch. The scuttle bundle provides some utilities to downsample count matrices, just technically speaking, downsampling on the reads is more appropriate as information technology recapitulates the upshot of differences in sequencing depth per cell. This can exist achieved by applying the downsampleReads
function to the molecule information file containing the read counts:
fix.seed(100) no.sampling <- downsampleReads(mol.info.file, prop=one) sum(no.sampling)
## [1] 9532
with.sampling <- downsampleReads(mol.info.file, prop=0.5) sum(with.sampling)
## [1] 9457
The above code will downsample the reads to 50% of the original coverage across the experiment. Still, the role will return a matrix of UMI counts, so the final full count may non really subtract if the libraries are sequenced to to saturation! Users should use downsampleMatrix()
instead if they want to guarantee similar full counts after downsampling.
Calculating barcode ranks
A useful diagnostic for droplet-based data is the barcode rank plot, which shows the (log-)total UMI count for each barcode on the y-axis and the (log-)rank on the ten-centrality. This is effectively a transposed empirical cumulative density plot with log-transformed axes. It is useful equally it allows users to examine the distribution of full counts across barcodes, focusing on those with the largest counts. To demonstrate, let u.s.a. mock up a count matrix:
set.seed(0) my.counts <- DropletUtils:::simCounts()
We compute the statistics using the barcodeRanks
function, and then create the plot as shown beneath.
br.out <- barcodeRanks(my.counts) # Making a plot. plot(br.out$rank, br.out$total, log="xy", xlab="Rank", ylab="Total") o <- gild(br.out$rank) lines(br.out$rank[o], br.out$fitted[o], col="red") abline(h=metadata(br.out)$genu, col="dodgerblue", lty=2) abline(h=metadata(br.out)$inflection, col="forestgreen", lty=ii) legend("bottomleft", lty=2, col=c("dodgerblue", "forestgreen"), legend=c("knee joint", "inflection"))
The knee and inflection points on the curve mark the transition between two components of the total count distribution. This is assumed to correspond the difference between empty droplets with picayune RNA and cell-containing droplets with much more RNA, though a more rigorous method for distinguishing betwixt these two possibilities is discussed beneath.
Detecting empty droplets
Empty aerosol often incorporate RNA from the ambient solution, resulting in non-zero counts after debarcoding. The emptyDrops
office is designed to distinguish between empty droplets and cells. It does and then by testing each barcode'south expression profile for pregnant deviation from the ambience contour. Given a matrix my.counts
containing UMI counts for all barcodes, we phone call:
set.seed(100) e.out <- emptyDrops(my.counts) e.out
## DataFrame with 11100 rows and 5 columns ## Total LogProb PValue Express FDR ## <integer> <numeric> <numeric> <logical> <numeric> ## i ii NA NA NA NA ## 2 9 NA NA NA NA ## 3 xx NA NA NA NA ## 4 20 NA NA NA NA ## 5 1 NA NA NA NA ## ... ... ... ... ... ... ## 11096 215 -246.428 9.999e-05 True 0.00013799 ## 11097 201 -250.234 9.999e-05 True 0.00013799 ## 11098 247 -275.905 9.999e-05 TRUE 0.00013799 ## 11099 191 -228.763 nine.999e-05 Truthful 0.00013799 ## 11100 198 -233.043 9.999e-05 TRUE 0.00013799
Aerosol with significant deviations from the ambient profile are detected at a specified FDR threshold, e.m., with FDR
beneath 1%. These can be considered to exist cell-containing aerosol, with a frequency of fake positives (i.e., empty droplets) at the specified FDR. Furthermore, droplets with very large counts are automatically retained past setting their p-values to aught. This avoids discarding aerosol containing cells that are very like to the ambient profile.
is.prison cell <- e.out$FDR <= 0.01 sum(is.prison cell, na.rm=True)
## [1] 943
The p-values are calculated by permutation testing, hence the need to fix a seed. The Limited
field indicates whether a lower p-value could be obtained by increasing the number of permutations. If there are any entries with FDR
higher up the desired threshold and Limited==Truthful
, information technology indicates that npts
should be increased in the emptyDrops
phone call.
table(Express=e.out$Limited, Significant=is.cell)
## Significant ## Express Faux TRUE ## FALSE 357 843 ## TRUE 0 100
We recommend making some diagnostic plots such as the total count against the negative log-probability. Droplets detected as cells should show upwardly with big negative log-probabilities or very large total counts (based on the knee point reported by barcodeRanks
). Note that the example below is based on simulated information and is quite exaggerated.
plot(e.out$Full, -due east.out$LogProb, col=ifelse(is.cell, "red", "black"), xlab="Total UMI count", ylab="-Log Probability")
Demultiplexing hashed libraries
Cell hashing experiments can exist demultiplexed using the hashedDrops()
function on the set of prison cell-containing barcode libraries. To demonstrate, we will mock up some hash tag oligo (HTO) counts for a population with cells from each of 10 samples. We volition as well add some doublets and empty droplets for some flavor:
prepare.seed(10000) # Simulating empty droplets: nbarcodes <- 1000 nhto <- ten y <- matrix(rpois(nbarcodes*nhto, 20), nrow=nhto) # Simulating cells: ncells <- 100 true.sample <- sample(nhto, ncells, supercede=True) y[cbind(true.sample, seq_len(ncells))] <- 1000 # Simulating doublets: ndoub <- ncells/10 adjacent.sample <- (true.sample[1:ndoub] + 1) %% nrow(y) next.sample[side by side.sample==0] <- nrow(y) y[cbind(next.sample, seq_len(ndoub))] <- 500
Our first task is to identify the barcodes that actually contain cells. If we already did the calling with emptyDrops()
, nosotros could just re-utilize those calls; otherwise we can obtain calls straight from the HTO count matrix, though this requires some niggling with lower=
to match the sequencing depth of the HTO library.
hto.calls <- emptyDrops(y, lower=500) has.cell <- hto.calls$FDR <= 0.001 summary(has.cell)
## Mode TRUE NA's ## logical 100 900
Each prison cell-containing barcode libary is just assigned to the sample of origin based on its most arable HTO. The confidence of the consignment is quantified by the log-fold change between the acme and second-most abundant HTOs. The role will automatically conform for differences in the ambient levels of each HTO based on the ambient profile; if this is not provided, it is roughly estimated the ambient profile from the supplied count matrix.
demux <- hashedDrops(y[,which(has.cell)], ambient=metadata(hto.calls)$ambient) demux
## DataFrame with 100 rows and vii columns ## Total Best Second LogFC LogFC2 Doublet Confident ## <numeric> <integer> <integer> <numeric> <numeric> <logical> <logical> ## 1 1657 four 5 0.999462 4.60496 True Faux ## 2 1635 8 9 0.999492 4.84165 TRUE Faux ## 3 1669 half-dozen 7 0.999473 iv.45073 Truthful Fake ## 4 1674 6 7 0.999491 4.49983 True FALSE ## v 1645 3 4 1.000292 4.74602 TRUE FALSE ## ... ... ... ... ... ... ... ... ## 96 1167 iii one 5.31708 0.427468 Fake Truthful ## 97 1158 3 ane 5.26081 0.526363 Faux Truthful ## 98 1179 4 nine 5.00121 0.604380 FALSE Truthful ## 99 1187 two 5 5.37410 0.196833 FALSE Truthful ## 100 1177 v 8 5.15739 0.464633 Fake TRUE
It is then a uncomplicated matter to determine the sample of origin for each jail cell. We provide Confident
calls to signal which cells are confident singlets, based on the whether they are (i) not doublets and (2) do not have small-scale log-fold changes between the height and second HTO. The definition of "small" is relative and tin can be changed with the nmad=
statement.
table(demux$Best[demux$Confident])
## ## 1 2 3 4 5 vi 7 eight 9 10 ## 10 fifteen nine 7 12 eight 6 6 10 half dozen
We also place doublets based on the log-fold change betwixt the second HTO's abundance and the ambience contamination. A big log-fold change indicates that the second HTO exceeds that from contamination, consistent with the presence of a doublet.
colors <- ifelse(demux$Confident, "black", ifelse(demux$Doublet, "red", "gray")) plot(demux$LogFC, demux$LogFC2, col=colors, xlab="Log-fold change betwixt all-time and second HTO", ylab="Log-fold change between second HTO and ambient")
Removing swapping effects
Barcode swapping betwixt samples
Barcode swapping is a miracle that occurs upon multiplexing samples on the Illumina 4000 sequencer. Molecules from ane sample are incorrectly labelled with sample barcodes from another sample, resulting in their misassignment upon demultiplexing. Fortunately, droplet experiments provide a unique opportunity to eliminate this consequence, by assuming that it is effectively impossible to generate multiple molecules with the same combination of cell barcode, assigned factor and UMI sequence. Thus, any molecules with the same combination beyond multiple samples are likely to arise from barcode swapping.
The swappedDrops
function will identify overlapping combinations in the molecule information files of all multiplexed 10X samples sequenced on the same run. It will remove these combinations and render "cleaned" UMI count matrices for all samples to use in downstream analyses. To demonstrate, we mock up a set of molecule information files for three multiplexed 10X samples:
set.seed(grand) mult.mol.info <- DropletUtils:::simSwappedMolInfo(tempfile(), nsamples=iii) mult.mol.info
## [1] "/tmp/Rtmpsd303I/file8dc23681ecd81.1.h5" ## [2] "/tmp/Rtmpsd303I/file8dc23681ecd81.ii.h5" ## [three] "/tmp/Rtmpsd303I/file8dc23681ecd81.3.h5"
Nosotros then utilize swappedDrops
to these files to remove the effect of swapping in our count matrices.
south.out <- swappedDrops(mult.mol.info, min.frac=0.ix) length(due south.out$cleaned)
## [i] three
class(due south.out$cleaned[[1]])
## [1] "dgCMatrix" ## attr(,"package") ## [1] "Matrix"
For combinations where xc% of the reads belong to a single sample, the molecule is assigned to that sample rather than being removed. This assumes that swapping is relatively rare, so that the read count should exist highest in the sample of origin. The exact percent tin be tuned by altering min.frac
in the swappedDrops
call.
Chimeric reads inside cells
On occasion, chimeric molecules are generated during library training where incomplete PCR products from one cDNA molecule hybridise to another molecule for extension using shared sequences like the poly-A tail for 3' protocols. This produces an amplicon where the UMI and cell barcode originate from one transcript molecule but the gene sequence is from another, equivalent to swapping of reads between genes. We handle this effect past removing all molecules in the same cell with the same UMI sequence using the chimericDrops()
role. This is applied below to a molecule information file to obtain a unmarried cleaned count matrix for the relevant sample.
out <- chimericDrops(mult.mol.info[ane]) class(out)
## [1] "list"
Of course, this may also remove non-chimeric molecules that have the same UMI by chance, but for typical UMI lengths (x-12 bp for 10X protocols) nosotros expect UMI collisions to be very rare between molecules from the same cell. Withal, to mitigate losses due to collisions, we retain any molecule that has a much greater number of reads compared to all other molecules with the same UMI in the same cell.
Session data
sessionInfo()
## R Under development (unstable) (2021-10-xix r81077) ## Platform: x86_64-pc-linux-gnu (64-scrap) ## Running under: Ubuntu 20.04.three LTS ## ## Matrix products: default ## BLAS: /abode/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so ## LAPACK: /dwelling house/biocbuild/bbs-three.15-bioc/R/lib/libRlapack.so ## ## locale: ## [i] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-viii ## [7] LC_PAPER=en_US.UTF-eight LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [eleven] LC_MEASUREMENT=en_US.UTF-viii LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [eight] base ## ## other attached packages: ## [1] Matrix_1.3-4 DropletUtils_1.15.ii ## [3] SingleCellExperiment_1.17.1 SummarizedExperiment_1.25.two ## [5] Biobase_2.55.0 GenomicRanges_1.47.3 ## [7] GenomeInfoDb_1.31.ane IRanges_2.29.0 ## [9] S4Vectors_0.33.2 BiocGenerics_0.41.ane ## [11] MatrixGenerics_1.seven.0 matrixStats_0.61.0 ## [13] knitr_1.36 BiocStyle_2.23.0 ## ## loaded via a namespace (and not attached): ## [1] locfit_1.5-9.four xfun_0.28 ## [3] bslib_0.three.ane beachmat_2.11.0 ## [five] HDF5Array_1.23.0 lattice_0.20-45 ## [vii] rhdf5_2.39.0 htmltools_0.5.ii ## [9] yaml_2.2.ane rlang_0.4.12 ## [11] R.oo_1.24.0 jquerylib_0.1.iv ## [13] scuttle_1.5.0 R.utils_2.11.0 ## [15] BiocParallel_1.29.1 dqrng_0.3.0 ## [17] GenomeInfoDbData_1.2.vii stringr_1.4.0 ## [19] zlibbioc_1.41.0 R.methodsS3_1.8.i ## [21] evaluate_0.14 fastmap_1.1.0 ## [23] parallel_4.2.0 highr_0.nine ## [25] Rcpp_1.0.7 edgeR_3.37.0 ## [27] BiocManager_1.xxx.16 limma_3.51.0 ## [29] DelayedArray_0.21.1 magick_2.7.3 ## [31] jsonlite_1.7.ii XVector_0.35.0 ## [33] digest_0.6.28 stringi_1.7.five ## [35] bookdown_0.24 grid_4.ii.0 ## [37] tools_4.two.0 bitops_1.0-7 ## [39] rhdf5filters_1.7.0 magrittr_2.0.1 ## [41] sass_0.four.0 RCurl_1.98-i.5 ## [43] DelayedMatrixStats_1.17.0 sparseMatrixStats_1.7.0 ## [45] rmarkdown_2.xi Rhdf5lib_1.17.0 ## [47] R6_2.v.1 compiler_4.2.0
alexanderyoulthad95.blogspot.com
Source: https://bioconductor.org/packages/devel/bioc/vignettes/DropletUtils/inst/doc/DropletUtils.html
Enregistrer un commentaire for "Single Cell Rna Seq Which Read Is My Barcoded Read"