MethSCAn

List of available commands

Usage: methscan [OPTIONS] COMMAND [ARGS]...

   __  __      _   _     ____   ____    _  
  |  \/  | ___| |_| |__ / ___| / ___|  / \   _ __
  | |\/| |/ _ \ __| '_ \\___ \| |     / _ \ | '_ \  
  | |  | |  __/ |_| | | |___) | |___ / ___ \| | | |
  |_|  |_|\___|\__|_| |_|____/ \____/_/   \_\_| |_| v1.0.2

    Below you find a list of all available commands. To find out what they do
    and how to use them, check their help like this:

    methscan [command] --help

    For documentation and a usage tutorial, go to 
    https://anders-biostat.github.io/MethSCAn/.

Options:
  --version  Show the version and exit.
  --cite     Show publication reference and exit.
  --help     Show this message and exit.

Commands:
  prepare  Collect and store sc-methylation data for quick access
  filter   Filter low-quality cells based on coverage and mean methylation
  smooth   Smooth the pseudobulk of single cell methylation data
  scan     Scan the genome to discover variably methylated regions
  diff     Discover differentially methylated regions between groups of cells
  matrix   Make a methylation matrix, similar to a count matrix in scRNA-seq
  profile  Plot mean methylation around a group of genomic features

prepare

Usage: methscan prepare [OPTIONS] [INPUT_FILES]... DATA_DIR

  Gathers single cell methylation data from multiple input files (one per
  cell) and creates a sparse matrix (position x cell) in CSR format for each
  chromosome. Methylated sites are represented by a 1, unmethylated sites are
  -1, missing values and other bases are 0.

  INPUT_FILES are single cell methylation files, for example
  '.cov'-files generated by Bismark.

  DATA_DIR is the output directory where the methylation data will be
  stored.

  Note: If you have many cells and encounter a "too many open files"- error,
  you need to increase the open file limit with e.g. 'ulimit -n 99999'.

Options:
  --round-sites        Specify that you want to round methylation sites with
                       ambigous status to 0% or 100%. This means, for example,
                       that a CpG site with 5 methylated reads and 1
                       unmethylated read will be considered methylated in that
                       cell. Otherwise, ambiguous sites will be discarded.
                       Note that sites with the same number of methylated and
                       unmethylated reads will always be discarded.
  --chunksize INTEGER  The data of each chromosome is read in chunks [default:
                       10 Mbp] to reduce memory requirements. If you are
                       running out of RAM, decrease the chunk size (in bp).
                       [x>=1]
  --input-format TEXT  Specify the format of the input files. Options:
                       'bismark' (default), 'methylpy', 'allc', 'biscuit',
                       'biscuit_short' or custom (see below).
                       
                       You can specify a custom format by specifying the separator, whether the
                       file has a header, and which information is stored in which columns. These
                       values should be separated by ':' and enclosed by quotation marks, for
                       example --input-format '1:2:3:4u:\t:1'
                       
                       The six ':'-separated values denote the number of the columns that contain
                       1. the chromosome name
                       2. the genomic position
                       3. the methylated counts
                       4. either the total coverage (c) or the number of unmethylated counts (u), followed
                       by either 'c' or 'u', e.g. '4c' to denote that the 4th column contains the coverage
                       5. The separator, e.g. '\t' or 'TAB' for tsv files or ',' for csv
                       6. Either '1' if the file has a header or '0' if it does not have a header
                       All column numbers are 1-indexed, i.e. to define the first column use '1' and not
                       '0'.
  --help               Show this message and exit.

filter

Usage: methscan filter [OPTIONS] DATA_DIR FILTERED_DIR

  Filters low-quality cells based on the number of observed methylation sites
  and/or the global methylation percentage.

  Alternatively, you may also provide a text file with the names of the cells
  you want to keep.

  DATA_DIR is the unfiltered directory containing the methylation
  matrices produced by running 'methscan prepare'.

  FILTERED_DIR is the output directory storing methylation data only
  for the cells that passed all filtering criteria.

Options:
  --min-sites INTEGER    Minimum number of methylation sites required for a
                         cell to pass filtering.  [x>=1]
  --max-sites INTEGER    Maximum number of methylation sites required for a
                         cell to pass filtering.  [x>=1]
  --min-meth PERCENT     Minimum average methylation percentage required for a
                         cell to pass filtering.  [0<=x<=100]
  --max-meth PERCENT     Maximum average methylation percentage required for a
                         cell to pass filtering.  [0<=x<=100]
  --cell-names FILENAME  A text file with the names of the cells you want to
                         keep (default) or remove. This is an alternative to
                         the min/max filtering options. Each cell name must be
                         on a new line.
  --keep / --discard     Specify whether the cells listed in your text file
                         should be kept (default) or discarded from the data
                         set. Only use together with --cell-names.
  --help                 Show this message and exit.

smooth

Usage: methscan smooth [OPTIONS] DATA_DIR

  This script will calculate the smoothed mean methylation over the whole
  genome.

  DATA_DIR is the directory containing the methylation matrices
  produced by running 'methscan prepare'.

  The smoothed methylation values will be written to
  DATA_DIR/smoothed/.

Options:
  -bw, --bandwidth INTEGER  Smoothing bandwidth in basepairs.  [default: 1000;
                            x>=1]
  --use-weights             Use this to weigh each methylation site by
                            log1p(coverage).  [default: off]
  --help                    Show this message and exit.

scan

Usage: methscan scan [OPTIONS] DATA_DIR OUTPUT

  Scans the whole genome for variably methylated regions (VMRs). This works by
  sliding a window across the genome, calculating the variance of methylation
  per window, and selecting windows above a variance threshold.

  DATA_DIR is the directory containing the methylation matrices
  produced by running 'methscan prepare', as well as the smoothed methylation
  values produced by running 'methscan smooth'.

  OUTPUT is the path of the output file in '.bed' format, containing
  the VMRs that were found.

Options:
  -bw, --bandwidth INTEGER  Bandwidth of the sliding window in basepairs.
                            Increase this value to find larger VMRs.
                            [default: 2000; x>=1]
  --stepsize INTEGER        Step size of the sliding window in basepairs.
                            Should be smaller than the bandwidth. Increase
                            this value to gain speed, at the cost of some
                            accuracy.  [default: 100; x>=1]
  --var-threshold FLOAT     The variance threshold, i.e. 0.02 means that the
                            top 2% most variable genomic bins will be merged
                            and reported as VMRs.  [default: 0.02; 0<=x<=1]
  --min-cells INTEGER       The minimum number of cells required to report a
                            VMR. For example, a value of 6 means that only
                            VMRs with sequencing coverage in at least 6 cells
                            are reported.  [default: 6; x>=1]
  --bridge-gaps INTEGER     Merge neighboring VMRs if they are within this
                            distance in basepairs. Useful to prevent
                            fragmented VMRs separated only by small gaps.
                            [default: off]  [x>=0]
  --threads INTEGER         How many CPU threads to use in parallel.
                            [default: all available]
  --write-header            Write the column names of the output file.
                            [default: off]
  --help                    Show this message and exit.

diff

Usage: methscan diff [OPTIONS] DATA_DIR CELL_GROUPS OUTPUT

  Scans the whole genome for differentially methylated regions (DMRs) between
  two groups of cells. This works by sliding a window across the genome,
  performing a t-test for each window, and merging windows above a threshold.
  To control the false discovery rate, the same procedure is repeated on
  permutations of the data which are then used to calculate an adjusted
  p-value for each DMR.

  DATA_DIR is the directory containing the methylation matrices
  produced by running 'methscan prepare', as well as the smoothed methylation
  values produced by running 'methscan smooth'.

  CELL_GROUPS is a comma-separated text file that lists the group
  membership (e.g. cell type or treatment) of each cell. Each row contains two
  comma-separated values: The cell name and its group label. Cell names are
  denoted in DATA_DIR/column_header.txt. Only two cell groups can be
  compared, so there should only be two unique group labels (e.g. "neuron" and
  "glia" or "wildtype" and "KO"). To exclude cells that do not belong to
  either group from the analysis, you can assign them the group label '-'
  (dash character).

  OUTPUT is the path of the output file in '.bed' format, containing
  the DMR genome coordinates, their t-statistic, the cell group in which the
  DMR has lower methylation, and the adjusted p-value.

Options:
  -bw, --bandwidth INTEGER  Bandwidth of the sliding window in basepairs.
                            Increase this value to find larger DMRs.
                            [default: 2000; x>=1]
  --stepsize INTEGER        Step size of the sliding window in basepairs.
                            Increase this value to gain speed, at the cost of
                            some accuracy.  [default: 1000; x>=1]
  --threshold FLOAT         The t-statistic threshold, i.e. 0.02 means that
                            the top 2% and bottom 2% most differentially
                            methylated genomic bins will be separately merged
                            and reported as DMRs with adjusted p-values.
                            [default: 0.02; 0<=x<=1]
  --min-cells INTEGER       The minimum number of cells required to consider a
                            genomic region for testing. For example, a value
                            of 6 means that only regions with sequencing
                            coverage in at least 6 cells per group are
                            considered.  [default: 6; x>=1]
  --bridge-gaps INTEGER     Merge neighboring DMRs if they are within this
                            distance in basepairs. Useful to prevent
                            fragmented DMRs separated only by small gaps.
                            [default: off]  [x>=0]
  --threads INTEGER         How many CPU threads to use in parallel.
                            [default: all available]
  --write-header            Write the column names of the output file.
                            [default: off]
  --debug                   Use this to also report DMRs that were identified
                            in permutations.  [default: off]
  --help                    Show this message and exit.

matrix

Usage: methscan matrix [OPTIONS] REGIONS DATA_DIR OUTPUT_DIR

  From single cell methylation or NOMe-seq data, calculates the average
  methylation in genomic regions for every cell. The output is a long table
  that can be used e.g. for dimensionality reduction or clustering, analogous
  to a count matrix in scRNA-seq.

  REGIONS is a .bed file of regions for which methylation will be
  quantified in every cell.

  DATA_DIR is the directory containing the methylation matrices
  produced by running 'methscan prepare', as well as the smoothed methylation
  values produced by running 'methscan smooth'.

  OUTPUT_DIR is the output directory.
  It will contain four cell × region matrices ("count tables"):
  'methylated_sites.csv.gz': the number of sites that were methylated
  'total_sites.csv.gz': the total number of observed sites (sites with read coverage)
  'methylation_fractions.csv.gz': the average methylation, calculated as:
      # of methylated sites / # of total observed sites
  'mean_shrunken_residuals.csv.gz': the mean shrunken residuals, a more accurate
      measure of methylation in a genomic region.

Options:
  --sparse           [experimental] Write the output as a sparse matrix,
                     instead of the four .csv.gz files described above. This
                     is faster and more space-efficient for huge data sets.
                     The output 'matrix.mtx.gz' contains four columns:
                     row_index, col_index, shrunken residuals, methylation
                     fractions, coverage. Both indices are 1-indexed. Missing
                     values denote NA, not zero(!)  [default: off]
  --threads INTEGER  How many CPU threads to use in parallel.  [default: all
                     available]
  --help             Show this message and exit.

profile

Usage: methscan profile [OPTIONS] REGIONS DATA_DIR OUTPUT

  From single cell methylation or NOMe-seq data, calculates the average
  methylation profile of a set of genomic regions. Useful for plotting and
  visually comparing methylation between groups of regions or cells.

  REGIONS is an alphabetically sorted (!) .bed file of regions for
  which the methylation profile will be produced.

  DATA_DIR is the directory containing the methylation matrices
  produced by running 'methscan prepare'.

  OUTPUT is the file path where the methylation profile data will be
  written. Should end with '.csv'.

Options:
  --width INTEGER          The total width of the profile plot in bp. The
                           center of all bed regions will be extended in both
                           directions by half of this amount. Shorter regions
                           will be extended, longer regions will be shortened
                           accordingly.  [default: 4000; x>=1]
  --strand-column INTEGER  The bed column number (1-indexed) denoting the DNA
                           strand of the region.  [optional]  [x>=1]
  --label TEXT             Specify a constant value to be added as a column to
                           the output table. This can be useful to give each
                           output a unique label when you want to concatenate
                           multiple outputs.  [optional]
  --help                   Show this message and exit.