Count matrices
The mat_counts
input is required for muscadet analysis and must be provided with CreateMuscomicObject()
.
This input should be a matrix (sparse dgCMatrix
or not) of raw read counts with features as rows and cells as columns (Table 1).
See ?mat_counts
for full documentation.
data("mat_counts_atac_tumor")
kable(mat_counts_atac_tumor[698:700,34:36])
- From Seurat/Signac analysis:
library(Seurat) # Seurat v5
library(Signac)
data("atac_small") # Example Seurat class object from Signac
# dgCMatrix matrix from Assay named "RNA"
mat_counts_RNA <- atac_small[["RNA"]]$counts
# dgCMatrix matrix from ChromatinAssay named "peaks"
mat_counts_ATAC <- atac_small[["peaks"]]$counts
- From ArchR/SummarizedExperiment analysis:
library(SummarizedExperiment)
# RangedSummarizedExperiment class object
se <- readRDS("<SummarizedExperiment_path>")
# dgCMatrix matrix
mat_counts_ATAC <- assay(se)
Allele counts tables
The allele_counts
input is optional for the clustering step (allelic information not used for clustering) but required for the CNA calling step: it can be provided in the initial object creation step with CreateMuscomicObject()
or added to muscadet
objects later on with addAlleleCounts()
.
This input should be in a data frame format containing allelic read counts at specific variant positions across cells (Table 2). The format follows a Variant Calling Format (VCF)-like structure and should include:
-
cell
: unique cell barcode
-
id
: variant identifier (e.g. CHROM_POS_REF_ALT)
-
CHROM
: chromosome
-
POS
: position
-
REF
/ ALT
: reference and alternate alleles
-
RD
/ AD
: counts for reference and alternate alleles
-
DP
: total depth
-
GT
: genotype (optional)
See ?allele_counts
for full documentation.
data("allele_counts_atac_tumor")
kable(head(allele_counts_atac_tumor))
Variant positions
Variant positions can be derived from:
- Matched bulk sequencing to identify individual-specific heterozygous SNPs, or
- Reference panels of common SNPs (e.g. gnomAD, 1000G).
Individual-specific heterozygous positions from bulk data
Positions can be retrieved by running FACETS on matched WGS/WES normal samples.
Use snp-pileup
to extract allele counts from bulk BAM files at known sites (VCF of common SNPs from the NCBI database). The output of snp-pileup
can be filtered on depth and allele frequency to keep only heterozygous positions well covered.
See FACETS snp-pileup documentation.
Count reads from single cells
SCReadCounts is used to get read counts per allele from single cell BAM files.
The program scReadCounts
manages the sequential execution of programs readCounts
and readCountsMatrix
, collects the necessary arguments for successful execution, and avoids unnecessary execution of the expensive readCounts
tool if possible. readCounts
requires three input files: a pooled single cell alignment, a list of genomic positions of interest, and the barcodes file (barcodes.tsv) . Optionally, readCounts
can be user-configured for read filtering and cell-barcode handling, including restriction to barcodes of interest (achieved through specifying barcodes file - barcodes.tsv). readCounts
utilizes the barcode information from the pooled single cell alignments and outputs the variant and reference read counts (nvar and nref, respectively), for each barcode (cell), restricted to those present in the barcodes file, in a tab separated text file. This file is then used as an input for the second program - readCountsMatrix
- which, upon providing an output prefix, generates two outputs: (1) a cell-position matrix with absolute nvar and nref counts, and (2) a cell-position matrix with the expressed VAFRNA. VAFRNA is estimated at a user-defined threshold of minimum required sequencing reads (minR); default minR = 5. readCountsMatrix
is time-efficient and can be re-run multiple times at various minR thresholds. Together, these tools facilitate single-cell level assessment of read counts. From https://horvathlab.github.io/NGS/SCReadCounts/
As the desired output is the tab separated text file with variant and reference read counts for each barcode, the command readCounts
is preferred to avoid unnecessary large matrix outputs from readCountsMatrix
.
readCounts \
-s <vcf_file.vcf> \
-r <bam_file.bam> \
-b None \
-G STARsolo_CB \
-f MPileup \
-o <output_file.tsv> \
-t <threads> \
&> <log_file>
Then, the output of readCounts
(table of variant and reference read counts) can be formatted to fit muscadet
input requirement with process_allele()
(Table 3, Table 4).
# Example data frame of readCounts results
readcounts <- data.frame(
CHROM = c("1", "1", "2"),
POS = c(10101, 20202, 30303),
REF = c("A", "G", "T"),
ALT = c("G", "A", "C"),
ReadGroup = c("cell1", "cell1", "cell1"),
SNVCountForward = c(5, 10, 3),
SNVCountReverse = c(4, 6, 2),
RefCountForward = c(20, 15, 10),
RefCountReverse = c(18, 12, 8)
)
readcounts$SNVCount <- readcounts$SNVCountForward + readcounts$SNVCountReverse
readcounts$RefCount <- readcounts$RefCountForward +readcounts$RefCountReverse
readcounts$GoodReads <- readcounts$SNVCount + readcounts$RefCount
readcounts[, "%BadRead"] <- c(0, 0, 0)
readcounts$VAF <- round(readcounts$SNVCount / readcounts$GoodReads, 4)
Feature coordinates
A table of features coordinates with 4 columns (CHROM
, start
, end
, id
) matching features of count matrices (section Section 1) is required (Table 5).
See ?features
for full documentation.
Peaks
library(Signac)
# Example Seurat class object from Signac
data("atac_small")
# GRanges class object
peaks_coord <- atac_small[["peaks"]]$ranges
peaks_coord <- as.data.frame(peaks_coord)
peaks_coord$id <- paste(peaks_coord$seqnames,
peaks_coord$start,
peaks_coord$end,
sep = "-")
peaks_coord <- peaks_coord[, c("seqnames", "start", "end", "id")]
colnames(peaks_coord) <- c("CHROM", "start", "end", "id")
- From ArchR/SummarizedExperiment:
library(SummarizedExperiment)
# RangedSummarizedExperiment class object
se <- readRDS("<SummarizedExperiment_path>")
# GRanges class object
peaks_coord <- rowRanges(se)
peaks_coord <- as.data.frame(peaks_coord)
peaks_coord$id <- paste(peaks_coord$seqnames, peaks_coord$start, peaks_coord$end, sep = "-")
peaks_coord <- peaks_coord[, c("seqnames", "start", "end", "id")]
colnames(peaks_coord) <- c("CHROM", "start", "end", "id")
Genes
# Use your own annotation
genes_coord <- read.delim( "<genes_gtf_file>")
genes_coord <- genes_coord[, c("seqnames", "start", "end", "gene_name")]
colnames(genes_coord) <- c("CHROM", "start", "end", "id")
# Or get coordinates
library(EnsDb.Hsapiens.v86) # for human
library(AnnotationDbi)
genes <- rownames(mat_counts_RNA)
genes_coord <- genes(EnsDb.Hsapiens.v86, filter = GeneNameFilter(genes))
genes_coord <- as.data.frame(genes_coord)
genes_coord <- genes_coord[genes_coord$seqnames %in% c(1:22, "X", "Y"), ]
genes_coord <- genes_coord[, c("seqnames", "start", "end", "gene_name")]
colnames(genes_coord) <- c("CHROM", "start", "end", "id")
rownames(genes_coord) <- NULL
Match cells and features
Important
Cell names (columns) and features names (rows) must match between assays and features coordinates table ids.
# Make sure row names of mat_counts match features ids
identical(rownames(mat_counts_ATAC), peaks_coord[, "id"])
rownames(mat_counts_ATAC) <- peaks_coord[, "id"]
table(rownames(mat_counts_RNA) %in% genes_coord[, "id"])
# Cell names format must match between different assays
intersect(colnames(mat_counts_ATAC), colnames(mat_counts_RNA))
Bulk log-ratios
A table of log ratios computed from matched bulk sequencing (e.g. Whole Genome Sequencing) can optionally be provided to CreateMuscadetObject()
and will be displayed at the bottom of log ratio heatmaps for validation purposes (Table 6).
Log ratios for bulk sequencing can be obtained through FACETS analysis.
See ?bulk_lrr
for full documentation.