Single Cell Rna Seq Which Read Is My Barcoded Read

mars 04, 2022 Enregistrer un commentaire

one Introduction
ii Reading in 10X Genomics data
- 2.1 From the UMI count matrix
- 2.ii From the molecule data file
3 Downsampling on the reads
4 Computing barcode ranks
5 Detecting empty aerosol
six Demultiplexing hashed libraries
7 Removing swapping effects
- seven.1 Barcode swapping between samples
- 7.2 Chimeric reads within cells
8 Session data

Introduction

Droplet-based single-cell RNA sequencing (scRNA-seq) technologies allow researchers to obtain transcriptome-broad expression profiles for thousands of cells at once. Briefly, each jail cell is encapsulated in a droplet in a oil-water emulsion, along with a dewdrop containing contrary transcription primers with a unique barcode sequence. After reverse transcription inside the droplet, each cell'due south cDNA is labelled with that barcode (referred to a "prison cell barcode"). Bursting of the aerosol yields a pool of cDNA for library preparation and sequencing. Debarcoding of the sequences tin and then exist performed to obtain the expression contour for each cell.

This package implements some general utilities for handling these information after quantification of expression. In detail, we focus on the 10X Genomics platform, providing functions to load in the matrix of unique molecule identifier (UMI) counts likewise as the raw molecule information. Functions are besides available for downsampling the UMI count matrix or the raw reads; for distinguishing cells from empty droplets, based on the UMI counts; and to eliminate the furnishings of barcode swapping on Illumina 4000 sequencing machine.

Reading in 10X Genomics data

From the UMI count matrix

The CellRanger pipeline from 10X Genomics will process the raw sequencing information and produce a matrix of UMI counts. Each row of this matrix corresponds to a gene, while each cavalcade corresponds to a prison cell barcode. This is saved in a single directory for each sample, usually named similar <OUTPUT>/outs/filtered_gene_bc_matrices/<GENOME> 1 1 1 If you employ the "filtered" matrix, each column corresponds to a putative cell. If you use the "raw" matrix, all barcodes are loaded, and no stardom is fabricated betwixt cells and empty droplets.. Nosotros mock up an example directory below using some simulated data:

              # To generate the files. case(write10xCounts, echo=FALSE)  dir.proper noun <- tmpdir list.files(dir.name)

              ## [i] "barcodes.tsv" "genes.tsv"    "matrix.mtx"

The matrix.mtx file contains the UMI counts, while the other two files incorporate the cell barcodes and the factor annotation. We can load this into memory using the read10xCounts function, which returns a SingleCellExperiment object containing all of the relevant data. This includes the barcode sequence for each cell (cavalcade), as well every bit the identifier and symbol for each cistron (row).

              sce <- read10xCounts(dir.name) sce

              ## class: SingleCellExperiment  ## dim: 100 10  ## metadata(1): Samples ## assays(1): counts ## rownames(100): ENSG00001 ENSG00002 ... ENSG000099 ENSG0000100 ## rowData names(2): ID Symbol ## colnames: NULL ## colData names(2): Sample Barcode ## reducedDimNames(0): ## mainExpName: Nix ## altExpNames(0):

The counts themselves are loaded as a sparse matrix, specifically a dgCMatrix from the Matrix package. This reduces retentivity usage by just storing the non-zero counts, which is useful for thin scRNA-seq information with lots of dropouts.

              class(counts(sce))

              ## [ane] "dgCMatrix" ## attr(,"package") ## [1] "Matrix"

Users can also load multiple samples at once by supplying a grapheme vector to read10xCounts. This will return a single SingleCellExperiment where all of the individual matrices are combined past cavalcade. Apparently, this only makes sense when the aforementioned gear up of genes is being used beyond samples.

From the molecule data file

CellRanger will also produce a molecule data file (molecule_info.h5) that contains… well, information about the transcript molecules. This includes the UMI sequencetwo 2 2 For readers who are unfamiliar with UMIs, they permit reads from dissimilar PCR amplicons to be unambiguously assigned to the same original molecule., the cell barcode sequence, the cistron to which it was assigned, and the number of reads covering the molecule. For demonstration purposes, we create an example molecule data file below:

              fix.seed(grand) mol.info.file <- DropletUtils:::simBasicMolInfo(tempfile()) mol.info.file

              ## [1] "/tmp/Rtmpsd303I/file8dc234977af1a"

We can subsequently load this information into our R session using the read10xMolInfo function:

              mol.info <- read10xMolInfo(mol.info.file) mol.info

              ## $data ## DataFrame with 9532 rows and 5 columns ##             cell       umi gem_group      gene     reads ##      <character> <integer> <integer> <integer> <integer> ## 1           TGTT     80506         1        18         8 ## 2           CAAT    722585         1        20         6 ## 3           AGGG    233634         1         4         6 ## four           TCCC    516870         1        10         nine ## v           ATAG    887407         1         6        12 ## ...          ...       ...       ...       ...       ... ## 9528        TACT   1043995         1         nine        12 ## 9529        GCTG    907401         1        twenty        13 ## 9530        ATTA    255710         i        13        ten ## 9531        GCAC    672962         1        20        11 ## 9532        TGAA    482852         1         1         vi ##  ## $genes ##  [1] "ENSG1"  "ENSG2"  "ENSG3"  "ENSG4"  "ENSG5"  "ENSG6"  "ENSG7"  "ENSG8"  ##  [9] "ENSG9"  "ENSG10" "ENSG11" "ENSG12" "ENSG13" "ENSG14" "ENSG15" "ENSG16" ## [17] "ENSG17" "ENSG18" "ENSG19" "ENSG20"

This information can be useful for quality control purposes, peculiarly when the underlying read counts are required, e.m., to investigate sequencing saturation. Note that the function will automatically guess the length of the barcode sequence, as this is not formally defined in the molecule information file. For virtually experiments, the gauge is correct, merely users can strength the part to employ a known barcode length with the barcode.length argument.

Downsampling on the reads

Given multiple batches of very different sequencing depths, it can be benign to downsample the deepest batches to friction match the coverage of the shallowest batches. This avoids differences in technical noise that can drive clustering by batch. The scuttle bundle provides some utilities to downsample count matrices, just technically speaking, downsampling on the reads is more appropriate as information technology recapitulates the upshot of differences in sequencing depth per cell. This can exist achieved by applying the downsampleReads function to the molecule information file containing the read counts:

            fix.seed(100) no.sampling <- downsampleReads(mol.info.file, prop=one) sum(no.sampling)

            ## [1] 9532

            with.sampling <- downsampleReads(mol.info.file, prop=0.5) sum(with.sampling)

            ## [1] 9457

The above code will downsample the reads to 50% of the original coverage across the experiment. Still, the role will return a matrix of UMI counts, so the final full count may non really subtract if the libraries are sequenced to to saturation! Users should use downsampleMatrix() instead if they want to guarantee similar full counts after downsampling.

Calculating barcode ranks

A useful diagnostic for droplet-based data is the barcode rank plot, which shows the (log-)total UMI count for each barcode on the y-axis and the (log-)rank on the ten-centrality. This is effectively a transposed empirical cumulative density plot with log-transformed axes. It is useful equally it allows users to examine the distribution of full counts across barcodes, focusing on those with the largest counts. To demonstrate, let u.s.a. mock up a count matrix:

            set.seed(0) my.counts <- DropletUtils:::simCounts()

We compute the statistics using the barcodeRanks function, and then create the plot as shown beneath.

            br.out <- barcodeRanks(my.counts)  # Making a plot. plot(br.out$rank, br.out$total, log="xy", xlab="Rank", ylab="Total") o <- gild(br.out$rank) lines(br.out$rank[o], br.out$fitted[o], col="red")  abline(h=metadata(br.out)$genu, col="dodgerblue", lty=2) abline(h=metadata(br.out)$inflection, col="forestgreen", lty=ii) legend("bottomleft", lty=2, col=c("dodgerblue", "forestgreen"),      legend=c("knee joint", "inflection"))

The knee and inflection points on the curve mark the transition between two components of the total count distribution. This is assumed to correspond the difference between empty droplets with picayune RNA and cell-containing droplets with much more RNA, though a more rigorous method for distinguishing betwixt these two possibilities is discussed beneath.

Detecting empty droplets

Empty aerosol often incorporate RNA from the ambient solution, resulting in non-zero counts after debarcoding. The emptyDrops office is designed to distinguish between empty droplets and cells. It does and then by testing each barcode'south expression profile for pregnant deviation from the ambience contour. Given a matrix my.counts containing UMI counts for all barcodes, we phone call:

            set.seed(100) e.out <- emptyDrops(my.counts) e.out

            ## DataFrame with 11100 rows and 5 columns ##           Total   LogProb    PValue   Express        FDR ##       <integer> <numeric> <numeric> <logical>  <numeric> ## i             ii        NA        NA        NA         NA ## 2             9        NA        NA        NA         NA ## 3            xx        NA        NA        NA         NA ## 4            20        NA        NA        NA         NA ## 5             1        NA        NA        NA         NA ## ...         ...       ...       ...       ...        ... ## 11096       215  -246.428 9.999e-05      True 0.00013799 ## 11097       201  -250.234 9.999e-05      True 0.00013799 ## 11098       247  -275.905 9.999e-05      TRUE 0.00013799 ## 11099       191  -228.763 nine.999e-05      Truthful 0.00013799 ## 11100       198  -233.043 9.999e-05      TRUE 0.00013799

Aerosol with significant deviations from the ambient profile are detected at a specified FDR threshold, e.m., with FDR beneath 1%. These can be considered to exist cell-containing aerosol, with a frequency of fake positives (i.e., empty droplets) at the specified FDR. Furthermore, droplets with very large counts are automatically retained past setting their p-values to aught. This avoids discarding aerosol containing cells that are very like to the ambient profile.

            is.prison cell <- e.out$FDR <= 0.01 sum(is.prison cell, na.rm=True)

            ## [1] 943

The p-values are calculated by permutation testing, hence the need to fix a seed. The Limited field indicates whether a lower p-value could be obtained by increasing the number of permutations. If there are any entries with FDR higher up the desired threshold and Limited==Truthful, information technology indicates that npts should be increased in the emptyDrops phone call.

            table(Express=e.out$Limited, Significant=is.cell)

            ##        Significant ## Express Faux TRUE ##   FALSE   357  843 ##   TRUE      0  100

We recommend making some diagnostic plots such as the total count against the negative log-probability. Droplets detected as cells should show upwardly with big negative log-probabilities or very large total counts (based on the knee point reported by barcodeRanks). Note that the example below is based on simulated information and is quite exaggerated.

            plot(e.out$Full, -due east.out$LogProb, col=ifelse(is.cell, "red", "black"),     xlab="Total UMI count", ylab="-Log Probability")

Demultiplexing hashed libraries

Cell hashing experiments can exist demultiplexed using the hashedDrops() function on the set of prison cell-containing barcode libraries. To demonstrate, we will mock up some hash tag oligo (HTO) counts for a population with cells from each of 10 samples. We volition as well add some doublets and empty droplets for some flavor:

            prepare.seed(10000)  # Simulating empty droplets: nbarcodes <- 1000 nhto <- ten y <- matrix(rpois(nbarcodes*nhto, 20), nrow=nhto)  # Simulating cells: ncells <- 100 true.sample <- sample(nhto, ncells, supercede=True) y[cbind(true.sample, seq_len(ncells))] <- 1000  # Simulating doublets: ndoub <- ncells/10 adjacent.sample <- (true.sample[1:ndoub]  + 1) %% nrow(y) next.sample[side by side.sample==0] <- nrow(y) y[cbind(next.sample, seq_len(ndoub))] <- 500

Our first task is to identify the barcodes that actually contain cells. If we already did the calling with emptyDrops(), nosotros could just re-utilize those calls; otherwise we can obtain calls straight from the HTO count matrix, though this requires some niggling with lower= to match the sequencing depth of the HTO library.

            hto.calls <- emptyDrops(y, lower=500) has.cell <- hto.calls$FDR <= 0.001 summary(has.cell)

            ##    Mode    TRUE    NA's  ## logical     100     900

Each prison cell-containing barcode libary is just assigned to the sample of origin based on its most arable HTO. The confidence of the consignment is quantified by the log-fold change between the acme and second-most abundant HTOs. The role will automatically conform for differences in the ambient levels of each HTO based on the ambient profile; if this is not provided, it is roughly estimated the ambient profile from the supplied count matrix.

            demux <- hashedDrops(y[,which(has.cell)],      ambient=metadata(hto.calls)$ambient) demux

            ## DataFrame with 100 rows and vii columns ##         Total      Best    Second     LogFC    LogFC2   Doublet Confident ##     <numeric> <integer> <integer> <numeric> <numeric> <logical> <logical> ## 1        1657         four         5  0.999462   4.60496      True     Faux ## 2        1635         8         9  0.999492   4.84165      TRUE     Faux ## 3        1669         half-dozen         7  0.999473   iv.45073      Truthful     Fake ## 4        1674         6         7  0.999491   4.49983      True     FALSE ## v        1645         3         4  1.000292   4.74602      TRUE     FALSE ## ...       ...       ...       ...       ...       ...       ...       ... ## 96       1167         iii         one   5.31708  0.427468     Fake      Truthful ## 97       1158         3         ane   5.26081  0.526363     Faux      Truthful ## 98       1179         4         nine   5.00121  0.604380     FALSE      Truthful ## 99       1187         two         5   5.37410  0.196833     FALSE      Truthful ## 100      1177         v         8   5.15739  0.464633     Fake      TRUE

It is then a uncomplicated matter to determine the sample of origin for each jail cell. We provide Confident calls to signal which cells are confident singlets, based on the whether they are (i) not doublets and (2) do not have small-scale log-fold changes between the height and second HTO. The definition of "small" is relative and tin can be changed with the nmad= statement.

            table(demux$Best[demux$Confident])

            ##  ##  1  2  3  4  5  vi  7  eight  9 10  ## 10 fifteen  nine  7 12  eight  6  6 10  half dozen

We also place doublets based on the log-fold change betwixt the second HTO's abundance and the ambience contamination. A big log-fold change indicates that the second HTO exceeds that from contamination, consistent with the presence of a doublet.

            colors <- ifelse(demux$Confident, "black",     ifelse(demux$Doublet, "red", "gray")) plot(demux$LogFC, demux$LogFC2, col=colors,     xlab="Log-fold change betwixt all-time and second HTO",     ylab="Log-fold change between second HTO and ambient")

Removing swapping effects

Barcode swapping betwixt samples

Barcode swapping is a miracle that occurs upon multiplexing samples on the Illumina 4000 sequencer. Molecules from ane sample are incorrectly labelled with sample barcodes from another sample, resulting in their misassignment upon demultiplexing. Fortunately, droplet experiments provide a unique opportunity to eliminate this consequence, by assuming that it is effectively impossible to generate multiple molecules with the same combination of cell barcode, assigned factor and UMI sequence. Thus, any molecules with the same combination beyond multiple samples are likely to arise from barcode swapping.

The swappedDrops function will identify overlapping combinations in the molecule information files of all multiplexed 10X samples sequenced on the same run. It will remove these combinations and render "cleaned" UMI count matrices for all samples to use in downstream analyses. To demonstrate, we mock up a set of molecule information files for three multiplexed 10X samples:

              set.seed(grand) mult.mol.info <- DropletUtils:::simSwappedMolInfo(tempfile(), nsamples=iii) mult.mol.info

              ## [1] "/tmp/Rtmpsd303I/file8dc23681ecd81.1.h5" ## [2] "/tmp/Rtmpsd303I/file8dc23681ecd81.ii.h5" ## [three] "/tmp/Rtmpsd303I/file8dc23681ecd81.3.h5"

Nosotros then utilize swappedDrops to these files to remove the effect of swapping in our count matrices.

              south.out <- swappedDrops(mult.mol.info, min.frac=0.ix) length(due south.out$cleaned)

              ## [i] three

              class(due south.out$cleaned[[1]])

              ## [1] "dgCMatrix" ## attr(,"package") ## [1] "Matrix"

For combinations where xc% of the reads belong to a single sample, the molecule is assigned to that sample rather than being removed. This assumes that swapping is relatively rare, so that the read count should exist highest in the sample of origin. The exact percent tin be tuned by altering min.frac in the swappedDrops call.

Chimeric reads inside cells

On occasion, chimeric molecules are generated during library training where incomplete PCR products from one cDNA molecule hybridise to another molecule for extension using shared sequences like the poly-A tail for 3' protocols. This produces an amplicon where the UMI and cell barcode originate from one transcript molecule but the gene sequence is from another, equivalent to swapping of reads between genes. We handle this effect past removing all molecules in the same cell with the same UMI sequence using the chimericDrops() role. This is applied below to a molecule information file to obtain a unmarried cleaned count matrix for the relevant sample.

              out <- chimericDrops(mult.mol.info[ane]) class(out)

              ## [1] "list"

Of course, this may also remove non-chimeric molecules that have the same UMI by chance, but for typical UMI lengths (x-12 bp for 10X protocols) nosotros expect UMI collisions to be very rare between molecules from the same cell. Withal, to mitigate losses due to collisions, we retain any molecule that has a much greater number of reads compared to all other molecules with the same UMI in the same cell.

Session data

            sessionInfo()

            ## R Under development (unstable) (2021-10-xix r81077) ## Platform: x86_64-pc-linux-gnu (64-scrap) ## Running under: Ubuntu 20.04.three LTS ##  ## Matrix products: default ## BLAS:   /abode/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so ## LAPACK: /dwelling house/biocbuild/bbs-three.15-bioc/R/lib/libRlapack.so ##  ## locale: ##  [i] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               ##  [3] LC_TIME=en_GB              LC_COLLATE=C               ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-viii    ##  [7] LC_PAPER=en_US.UTF-eight       LC_NAME=C                  ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             ## [eleven] LC_MEASUREMENT=en_US.UTF-viii LC_IDENTIFICATION=C        ##  ## attached base packages: ## [1] stats4    stats     graphics  grDevices utils     datasets  methods   ## [eight] base      ##  ## other attached packages: ##  [1] Matrix_1.3-4                DropletUtils_1.15.ii         ##  [3] SingleCellExperiment_1.17.1 SummarizedExperiment_1.25.two ##  [5] Biobase_2.55.0              GenomicRanges_1.47.3        ##  [7] GenomeInfoDb_1.31.ane         IRanges_2.29.0              ##  [9] S4Vectors_0.33.2            BiocGenerics_0.41.ane         ## [11] MatrixGenerics_1.seven.0        matrixStats_0.61.0          ## [13] knitr_1.36                  BiocStyle_2.23.0            ##  ## loaded via a namespace (and not attached): ##  [1] locfit_1.5-9.four            xfun_0.28                 ##  [3] bslib_0.three.ane               beachmat_2.11.0           ##  [five] HDF5Array_1.23.0          lattice_0.20-45           ##  [vii] rhdf5_2.39.0              htmltools_0.5.ii           ##  [9] yaml_2.2.ane                rlang_0.4.12              ## [11] R.oo_1.24.0               jquerylib_0.1.iv           ## [13] scuttle_1.5.0             R.utils_2.11.0            ## [15] BiocParallel_1.29.1       dqrng_0.3.0               ## [17] GenomeInfoDbData_1.2.vii    stringr_1.4.0             ## [19] zlibbioc_1.41.0           R.methodsS3_1.8.i         ## [21] evaluate_0.14             fastmap_1.1.0             ## [23] parallel_4.2.0            highr_0.nine                 ## [25] Rcpp_1.0.7                edgeR_3.37.0              ## [27] BiocManager_1.xxx.16       limma_3.51.0              ## [29] DelayedArray_0.21.1       magick_2.7.3              ## [31] jsonlite_1.7.ii            XVector_0.35.0            ## [33] digest_0.6.28             stringi_1.7.five             ## [35] bookdown_0.24             grid_4.ii.0                ## [37] tools_4.two.0               bitops_1.0-7              ## [39] rhdf5filters_1.7.0        magrittr_2.0.1            ## [41] sass_0.four.0                RCurl_1.98-i.5            ## [43] DelayedMatrixStats_1.17.0 sparseMatrixStats_1.7.0   ## [45] rmarkdown_2.xi            Rhdf5lib_1.17.0           ## [47] R6_2.v.1                  compiler_4.2.0

alexanderyoulthad95.blogspot.com

Source: https://bioconductor.org/packages/devel/bioc/vignettes/DropletUtils/inst/doc/DropletUtils.html

Alexander Youlthad95