Usage: methscan [OPTIONS] COMMAND [ARGS]...
__ __ _ _ ____ ____ _
| \/ | ___| |_| |__ / ___| / ___| / \ _ __
| |\/| |/ _ \ __| '_ \\___ \| | / _ \ | '_ \
| | | | __/ |_| | | |___) | |___ / ___ \| | | |
|_| |_|\___|\__|_| |_|____/ \____/_/ \_\_| |_| v1.0.2
Below you find a list of all available commands. To find out what they do
and how to use them, check their help like this:
methscan [command] --help
For documentation and a usage tutorial, go to
https://anders-biostat.github.io/MethSCAn/.
Options:
--version Show the version and exit.
--cite Show publication reference and exit.
--help Show this message and exit.
Commands:
prepare Collect and store sc-methylation data for quick access
filter Filter low-quality cells based on coverage and mean methylation
smooth Smooth the pseudobulk of single cell methylation data
scan Scan the genome to discover variably methylated regions
diff Discover differentially methylated regions between groups of cells
matrix Make a methylation matrix, similar to a count matrix in scRNA-seq
profile Plot mean methylation around a group of genomic features
Usage: methscan prepare [OPTIONS] [INPUT_FILES]... DATA_DIR
Gathers single cell methylation data from multiple input files (one per
cell) and creates a sparse matrix (position x cell) in CSR format for each
chromosome. Methylated sites are represented by a 1, unmethylated sites are
-1, missing values and other bases are 0.
INPUT_FILES are single cell methylation files, for example
'.cov'-files generated by Bismark.
DATA_DIR is the output directory where the methylation data will be
stored.
Note: If you have many cells and encounter a "too many open files"- error,
you need to increase the open file limit with e.g. 'ulimit -n 99999'.
Options:
--round-sites Specify that you want to round methylation sites with
ambigous status to 0% or 100%. This means, for example,
that a CpG site with 5 methylated reads and 1
unmethylated read will be considered methylated in that
cell. Otherwise, ambiguous sites will be discarded.
Note that sites with the same number of methylated and
unmethylated reads will always be discarded.
--chunksize INTEGER The data of each chromosome is read in chunks [default:
10 Mbp] to reduce memory requirements. If you are
running out of RAM, decrease the chunk size (in bp).
[x>=1]
--input-format TEXT Specify the format of the input files. Options:
'bismark' (default), 'methylpy', 'allc', 'biscuit',
'biscuit_short' or custom (see below).
You can specify a custom format by specifying the separator, whether the
file has a header, and which information is stored in which columns. These
values should be separated by ':' and enclosed by quotation marks, for
example --input-format '1:2:3:4u:\t:1'
The six ':'-separated values denote the number of the columns that contain
1. the chromosome name
2. the genomic position
3. the methylated counts
4. either the total coverage (c) or the number of unmethylated counts (u), followed
by either 'c' or 'u', e.g. '4c' to denote that the 4th column contains the coverage
5. The separator, e.g. '\t' or 'TAB' for tsv files or ',' for csv
6. Either '1' if the file has a header or '0' if it does not have a header
All column numbers are 1-indexed, i.e. to define the first column use '1' and not
'0'.
--help Show this message and exit.
Usage: methscan filter [OPTIONS] DATA_DIR FILTERED_DIR
Filters low-quality cells based on the number of observed methylation sites
and/or the global methylation percentage.
Alternatively, you may also provide a text file with the names of the cells
you want to keep.
DATA_DIR is the unfiltered directory containing the methylation
matrices produced by running 'methscan prepare'.
FILTERED_DIR is the output directory storing methylation data only
for the cells that passed all filtering criteria.
Options:
--min-sites INTEGER Minimum number of methylation sites required for a
cell to pass filtering. [x>=1]
--max-sites INTEGER Maximum number of methylation sites required for a
cell to pass filtering. [x>=1]
--min-meth PERCENT Minimum average methylation percentage required for a
cell to pass filtering. [0<=x<=100]
--max-meth PERCENT Maximum average methylation percentage required for a
cell to pass filtering. [0<=x<=100]
--cell-names FILENAME A text file with the names of the cells you want to
keep (default) or remove. This is an alternative to
the min/max filtering options. Each cell name must be
on a new line.
--keep / --discard Specify whether the cells listed in your text file
should be kept (default) or discarded from the data
set. Only use together with --cell-names.
--help Show this message and exit.
Usage: methscan smooth [OPTIONS] DATA_DIR
This script will calculate the smoothed mean methylation over the whole
genome.
DATA_DIR is the directory containing the methylation matrices
produced by running 'methscan prepare'.
The smoothed methylation values will be written to
DATA_DIR/smoothed/.
Options:
-bw, --bandwidth INTEGER Smoothing bandwidth in basepairs. [default: 1000;
x>=1]
--use-weights Use this to weigh each methylation site by
log1p(coverage). [default: off]
--help Show this message and exit.
Usage: methscan scan [OPTIONS] DATA_DIR OUTPUT
Scans the whole genome for variably methylated regions (VMRs). This works by
sliding a window across the genome, calculating the variance of methylation
per window, and selecting windows above a variance threshold.
DATA_DIR is the directory containing the methylation matrices
produced by running 'methscan prepare', as well as the smoothed methylation
values produced by running 'methscan smooth'.
OUTPUT is the path of the output file in '.bed' format, containing
the VMRs that were found.
Options:
-bw, --bandwidth INTEGER Bandwidth of the sliding window in basepairs.
Increase this value to find larger VMRs.
[default: 2000; x>=1]
--stepsize INTEGER Step size of the sliding window in basepairs.
Should be smaller than the bandwidth. Increase
this value to gain speed, at the cost of some
accuracy. [default: 100; x>=1]
--var-threshold FLOAT The variance threshold, i.e. 0.02 means that the
top 2% most variable genomic bins will be merged
and reported as VMRs. [default: 0.02; 0<=x<=1]
--min-cells INTEGER The minimum number of cells required to report a
VMR. For example, a value of 6 means that only
VMRs with sequencing coverage in at least 6 cells
are reported. [default: 6; x>=1]
--bridge-gaps INTEGER Merge neighboring VMRs if they are within this
distance in basepairs. Useful to prevent
fragmented VMRs separated only by small gaps.
[default: off] [x>=0]
--threads INTEGER How many CPU threads to use in parallel.
[default: all available]
--write-header Write the column names of the output file.
[default: off]
--help Show this message and exit.
Usage: methscan diff [OPTIONS] DATA_DIR CELL_GROUPS OUTPUT
Scans the whole genome for differentially methylated regions (DMRs) between
two groups of cells. This works by sliding a window across the genome,
performing a t-test for each window, and merging windows above a threshold.
To control the false discovery rate, the same procedure is repeated on
permutations of the data which are then used to calculate an adjusted
p-value for each DMR.
DATA_DIR is the directory containing the methylation matrices
produced by running 'methscan prepare', as well as the smoothed methylation
values produced by running 'methscan smooth'.
CELL_GROUPS is a comma-separated text file that lists the group
membership (e.g. cell type or treatment) of each cell. Each row contains two
comma-separated values: The cell name and its group label. Cell names are
denoted in DATA_DIR/column_header.txt. Only two cell groups can be
compared, so there should only be two unique group labels (e.g. "neuron" and
"glia" or "wildtype" and "KO"). To exclude cells that do not belong to
either group from the analysis, you can assign them the group label '-'
(dash character).
OUTPUT is the path of the output file in '.bed' format, containing
the DMR genome coordinates, their t-statistic, the cell group in which the
DMR has lower methylation, and the adjusted p-value.
Options:
-bw, --bandwidth INTEGER Bandwidth of the sliding window in basepairs.
Increase this value to find larger DMRs.
[default: 2000; x>=1]
--stepsize INTEGER Step size of the sliding window in basepairs.
Increase this value to gain speed, at the cost of
some accuracy. [default: 1000; x>=1]
--threshold FLOAT The t-statistic threshold, i.e. 0.02 means that
the top 2% and bottom 2% most differentially
methylated genomic bins will be separately merged
and reported as DMRs with adjusted p-values.
[default: 0.02; 0<=x<=1]
--min-cells INTEGER The minimum number of cells required to consider a
genomic region for testing. For example, a value
of 6 means that only regions with sequencing
coverage in at least 6 cells per group are
considered. [default: 6; x>=1]
--bridge-gaps INTEGER Merge neighboring DMRs if they are within this
distance in basepairs. Useful to prevent
fragmented DMRs separated only by small gaps.
[default: off] [x>=0]
--threads INTEGER How many CPU threads to use in parallel.
[default: all available]
--write-header Write the column names of the output file.
[default: off]
--debug Use this to also report DMRs that were identified
in permutations. [default: off]
--help Show this message and exit.
Usage: methscan matrix [OPTIONS] REGIONS DATA_DIR OUTPUT_DIR
From single cell methylation or NOMe-seq data, calculates the average
methylation in genomic regions for every cell. The output is a long table
that can be used e.g. for dimensionality reduction or clustering, analogous
to a count matrix in scRNA-seq.
REGIONS is a .bed file of regions for which methylation will be
quantified in every cell.
DATA_DIR is the directory containing the methylation matrices
produced by running 'methscan prepare', as well as the smoothed methylation
values produced by running 'methscan smooth'.
OUTPUT_DIR is the output directory.
It will contain four cell × region matrices ("count tables"):
'methylated_sites.csv.gz': the number of sites that were methylated
'total_sites.csv.gz': the total number of observed sites (sites with read coverage)
'methylation_fractions.csv.gz': the average methylation, calculated as:
# of methylated sites / # of total observed sites
'mean_shrunken_residuals.csv.gz': the mean shrunken residuals, a more accurate
measure of methylation in a genomic region.
Options:
--sparse [experimental] Write the output as a sparse matrix,
instead of the four .csv.gz files described above. This
is faster and more space-efficient for huge data sets.
The output 'matrix.mtx.gz' contains four columns:
row_index, col_index, shrunken residuals, methylation
fractions, coverage. Both indices are 1-indexed. Missing
values denote NA, not zero(!) [default: off]
--threads INTEGER How many CPU threads to use in parallel. [default: all
available]
--help Show this message and exit.
Usage: methscan profile [OPTIONS] REGIONS DATA_DIR OUTPUT
From single cell methylation or NOMe-seq data, calculates the average
methylation profile of a set of genomic regions. Useful for plotting and
visually comparing methylation between groups of regions or cells.
REGIONS is an alphabetically sorted (!) .bed file of regions for
which the methylation profile will be produced.
DATA_DIR is the directory containing the methylation matrices
produced by running 'methscan prepare'.
OUTPUT is the file path where the methylation profile data will be
written. Should end with '.csv'.
Options:
--width INTEGER The total width of the profile plot in bp. The
center of all bed regions will be extended in both
directions by half of this amount. Shorter regions
will be extended, longer regions will be shortened
accordingly. [default: 4000; x>=1]
--strand-column INTEGER The bed column number (1-indexed) denoting the DNA
strand of the region. [optional] [x>=1]
--label TEXT Specify a constant value to be added as a column to
the output table. This can be useful to give each
output a unique label when you want to concatenate
multiple outputs. [optional]
--help Show this message and exit.