AMULET: ATAC-seq MULtiplet Estimation Tool
A count based method for detecting doublets from single nucleus ATAC-seq (snATAC-seq) data.
Publication: https://doi.org/10.1186/s13059-021-02469-x
Source Code: https://github.com/UcarLab/AMULET
Note: The jar files can be found in the releases: https://github.com/UcarLab/AMULET/releases
AMULET is provided under the GPLv3 license.
Python 3 with the following libraries: - numpy - pandas - scipy - statsmodels
pip3 install numpy pandas scipy statsmodels
If using BAM files: Java SE 8 (build 231) or higher.
If the system requirements are satisfied:
chmod (e.g., chmod +x AMULET.sh)The shell file can be run from there. Java is portable and the jar should not need to be recompiled for most platforms.
Running AMULET consists of two parts:
1) Identifying all loci with >2 uniquely aligning reads for each cell.
2) detecting doublets cells that have more loci with >2 reads than expected.
The bash shell script combines both of these steps, but they can be run independently as well. The latter (doublet detection) may be useful for running doublet detection using q-values different from the default.
To run using the bash shell script, provide the following arguments: 1. Path to the bam file (e.g., possorted.bam from CellRanger) 2. Path to a barcode to cell_id map in CSV format (e.g., singlecell.csv from CellRanger) 3. Path to the list of chromosomes to use for the analysis (e.g., human_chromosomes_noxy.txt) 4. Path to known repetitive elements. (e.g., blacklist_repeats_segdups_rmsk_hg19.bed depending on genome) 5. Path to an existing output directory to write output files. 6. Path of this script’s directory.
Fragment file example:
AMULET.sh /path/to/fragments.tsv.gz /path/to/singlecell.csv /path/to/human_autosomes.txt /path/to/repeatfilter.bed /path/to/output/ /path/to/shellscript/
BAM file example:
AMULET.sh /path/to/possorted.bam /path/to/singlecell.csv /path/to/human_autosomes.txt /path/to/repeatfilter.bed /path/to/output/ /path/to/shellscript/
Note: If you know the input bam file is coordinate sorted and you get an error message saying it’s not, please use the --forcesorted option. There is a problem with the SAMReader library not recognizing this flag correctly in the header.
Example:
AMULET.sh --forcesorted /path/to/possorted.bam /path/to/singlecell.csv /path/to/human_autosomes.txt /path/to/repeatfilter.bed /path/to/output/ /path/to/shellscript/
The above can be used for common CellRanger outputs. Use the following options to adjust for differences in CellRanger outputs or for different inputs in general:
--expectedoverlap Expected number of reads overlapping. (Default: 2)
--maxinsertsize The maximum insert size (in bp) between read pairs. (Default: 900)
--startbases The amount of bases add to the start position. (BAM Default: 4, Frag Default: 0)
--endbases The amount of bases to add to the end position. (BAM Default: -5, Frag Default: 0)
BAM Input Only:
--bambc The bam file attribute to use to extract the barcode from reads (Default: “CB”)
--forcesorted Forces the input bam file to be treated as sorted.
--bcidx The column index (counting from 0) of the CSV file barcode to cell id map for the barcode. (Default: 0)
--cellidx The column index (counting from 0) of the CSV file barcode to cell id map for the cell id. (Default: 0)
--iscellidx The column index (counting from 0) of the CSV file barcode to cell id map for determining if the barcode corresponds to a cell. (Default: 9)
--mapqthresh Threshold for filtering low map quality reads (<= comparison). (Default: 30)
--maxinsertsize The maximum insert size (in bp) between read pairs. (Default: 900)
Examples:
AMULET.sh --bambc CB --bcidx 1 /path/to/possorted.bam /path/to/singlecell.csv /path/to/human_autosomes.txt /path/to/repeatfilter.bed /path/to/output/ /path/to/shellscript/
AMULET.sh --bcidx 1 --cellidx 2 --iscellidx 3 /path/to/possorted.bam /path/to/singlecell.csv /path/to/human_autosomes.txt /path/to/repeatfilter.bed /path/to/output/ /path/to/shellscript/
To run the fragment overlap counter, use python3 FragmentFileOverlapCounter.py and provide the following input arguments:
FRAGMENTFILE Path to the fragment file (e.g., fragments.tsv.gz from CellRanger)
BCMAP Path to a barcode to cell_id map in CSV format (e.g., singlecell.csv from CellRanger)
CHRLIST Path to the list of chromosomes to use for the analysis (e.g., human_autosomes.txt)
OUTDIR Path to an existing output directory to write output files.
Example:
python3 FragmentFileOverlapCounter.py BAMFILE BCMAP CHRLIST OUTDIR
Optional Arguments:
--expectedoverlap Expected number of reads overlapping. (Default: 2)
--maxinsertsize The maximum insert size (in bp) between read pairs. (Default: 900)
--startbases The amount of bases add to the start position. (Default: 0)
--endbases The amount of bases to add to the end position (can be negative). (Default: 0)
To run the overlap counter jar file, use java -jar snATACOverlapCounter.jar and provide the following input arguments:
BAMFILE Path to the bam file (e.g., possorted.bam from CellRanger)
BCMAP Path to a barcode to cell_id map in CSV format (e.g., singlecell.csv from CellRanger)
CHRLIST Path to the list of chromosomes to use for the analysis (e.g., human_autosomes.txt)
OUTDIR Path to an existing output directory to write output files.
Example:
java -jar snATACOverlapCounter.jar BAMFILE BCMAP CHRLIST OUTDIR
Optional Arguments:
--expectedoverlap Expected number of reads overlapping. (Default: 2)
--bambc The bam file attribute to use to extract the barcode from reads (Default: “CB”)
--forcesorted Forces the input bam file to be treated as sorted.
--bcidx The column index (counting from 0) of the CSV file barcode to cell id map for the barcode. (Default: 0)
--cellidx The column index (counting from 0) of the CSV file barcode to cell id map for the cell id. (Default: 8)
--iscellidx The column index (counting from 0) of the CSV file barcode to cell id map for determining if the barcode corresponds to a cell. (Default: 9)
--mapqthresh Threshold for filtering low map quality reads (<= comparison). (Default: 30)
--maxinsertsize The maximum insert size (in bp) between read pairs. (Default: 900)
--startbases The amount of bases add to the start position. (Default: 4)
--endbases The amount of bases to add to the end position (can be negative). (Default: -5)
Examples:
java -jar snATACOverlapCounter.jar --bambc CB BAMFILE BCMAP CHRLIST OUTDIR
java -jar snATACOverlapCounter.jar --bambc CB --cellidx 1 BAMFILE BCMAP CHRLIST OUTDIR
The doublet detection python file requires 3 input arguments.
OVERLAPS Path to the Overlaps.txt output from overlap counter.
OVERLAPSUMMARY Path to the OverlapSummary.txt output from overlap counter.
OUTDIR Path to an existing output directory to write output files.
Optional Arguments:
--rfilter A bed file of known repetitive regions to exclude from doublet detection.
--q The q-value threshold for detecting doublets.
--qrep The q-value threshold for inferring repetitive regions to exclude from doublet detection.
Example:
python3 AMULET.py --rfilter /path/to/repeats.bed OVERLAPS OVERLAPSUMMARY OUTDIR
The overlap counter script produces four output files describing the run and the data: Overlaps.txt, OverlapSummary.txt, StatSummary.txt, and RunTime.txt
A tab delimited file with the following columns:
A tab delimited file with the following columns:
A file containing the following information:
Shows the runtime (in seconds) for identifying all instances of read overlaps >2.
The multiplet detection python script produces three output files: MultipletProbabilities, MultipletCellIds_xx.txt, and MultipletBarcodes_xx.txt (xx corresponding to the q-value threshold used to call multiplets).
A tab delimited file with the following columns:
Files with the MultipletCellIds prefix correspond to multiplet cell ids with a q-value threshold specified by xx (i.e., 0.xx). For example 01 implies a q-value threshold of 0.01.
Files with the MultipletBarcodes prefix correspond to multiplet cell barcodes with a q-value threshold specified by xx (i.e., 0.xx). For example 01 implies a q-value threshold of 0.01.