PEPATAC is a robust pipeline for Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) built on a loosely coupled modular framework. It may be easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects. It is optimized on unique features of ATAC-seq data to be fast and accurate and provides several unique analytical approaches.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --cpus-per-task=8 --mem=8g --gres=lscratch:20 --time=36:00:00 [user@cn3200 ~]$ module load PEPATAC/0.12.2 [+] Loading pepatac 0.12.2 [+] Loading singularity 3.8.5-1 on cn3089 [user@cn3200 ~]$ pepatac -h usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-T] [--silent] [--verbosity V] [--logdev] [-C CONFIG_FILE] [-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT] [-P NUMBER_OF_CORES] [-S SAMPLE_NAME] -I INPUT_FILES [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [--trimmer {trimmomatic,pyadapt,skewer}] [--aligner {bowtie2,bwa}] [--deduplicator {picard,samblaster,samtools}] [--peak-caller {fseq,fseq2,genrich,hmmratac,homer,macs2}] [-gs GENOME_SIZE] [--peak-type {fixed,variable}] [--extend EXTEND] [--frip-ref-peaks FRIP_REF_PEAKS] [--motif] [--sob] [--no-scale] [--prioritize] [--keep] [--noFIFO] [--lite] [--skipqc] [--prealignment-names PREALIGNMENT_NAMES [PREALIGNMENT_NAMES ...]] [--prealignment-index PREALIGNMENT_INDEX [PREALIGNMENT_INDEX ...]] --genome-index GENOME_INDEX --chrom-sizes CHROM_SIZES [--TSS-name TSS_NAME] [--blacklist BLACKLIST] [--anno-name ANNO_NAME] [--search-file SEARCH_FILE] [-V] PEPATAC version 0.12.2 optional arguments: -h, --help show this help message and exit -R, --recover Overwrite locks to recover from previous failed run -N, --new-start Overwrite all results to start a fresh run -D, --dirty Don't auto-delete intermediate files -F, --force-follow Always run 'follow' commands -T, --testmode Only print commands, don't run --silent Silence logging. Overrides verbosity. --verbosity V Set logging level (1-5 or logging module level name) --logdev Expand content of logging message format. -C CONFIG_FILE, --config CONFIG_FILE Pipeline configuration file (YAML). Relative paths are with respect to the pipeline script. -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER Parent output directory of project -M MEMORY_LIMIT, --mem MEMORY_LIMIT Memory limit for processes accepting such. Default units are megabytes unless specified using the suffix [K|M|G|T]. -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES Number of cores for parallelized processes -S SAMPLE_NAME, --sample-name SAMPLE_NAME Name for sample to run -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]] Secondary input files, such as read2 -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED Single- or paired-end sequencing protocol --trimmer {trimmomatic,pyadapt,skewer} Name of read trimming program. --aligner {bowtie2,bwa} Name of read aligner. --deduplicator {picard,samblaster,samtools} Name of deduplicator program. --peak-caller {fseq,fseq2,genrich,hmmratac,homer,macs2} Name of peak caller. -gs GENOME_SIZE, --genome-size GENOME_SIZE Effective genome size. It can be 1.0e+9 or 1000000000: e.g. human (2.7e9), mouse (1.87e9), C. elegans (9e7), fruitfly (1.2e8). Default:2.7e9 --peak-type {fixed,variable} Call variable or fixed width peaks. Fixed width requires MACS2. --extend EXTEND How far to extend fixed width peaks up and downstream. --frip-ref-peaks FRIP_REF_PEAKS Path to reference peak set (BED format) for calculating FRiP. --motif Perform motif enrichment analysis. --sob Use seqOutBias to produce signal tracks, incorporate mappability information, and account for Tn5 bias. --no-scale Do not scale signal tracks: Default is to scale by read count. If using seqOutBias, scales by the expected/observed cut frequency. --prioritize Plot cFRiF/FRiF using mutually exclusive priority ranked features based on the order of feature appearance in the feature annotation asset. --keep Enable this flag to keep prealignment BAM files. --noFIFO Do NOT use named pipes during prealignments. --lite Only keep minimal, essential output to conserve disk space. --skipqc Skip FastQC. Useful for bugs in FastQC that appear with some sequence read files. --prealignment-names PREALIGNMENT_NAMES [PREALIGNMENT_NAMES ...] Space-delimited list of prealignment genome names to align to before primary alignment. --prealignment-index PREALIGNMENT_INDEX [PREALIGNMENT_INDEX ...] Space-delimited list of prealignment genome name and index files delimited by an equals sign to align to before primary alignment. e.g. rCRSd=/path/to/bowtie2_index/. --genome-index GENOME_INDEX Path to primary genome index file. Either a bowtie2 or bwa index. --chrom-sizes CHROM_SIZES Path to primary genome chromosome sizes file. --TSS-name TSS_NAME Path to TSS annotation file. --blacklist BLACKLIST Path to genomic region blacklist file. --anno-name ANNO_NAME Path to reference annotation file (BED format) for calculating FRiF. --search-file SEARCH_FILE Required for seqOutBias (--sob). Path to tallymer index search file built with the same read length as the input. -V, --version show program's version number and exit required named arguments: -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...] One or more primary input files -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY Identifier for genome assemblyDescribed below are the steps to run PEPATAC on the tutorial example.
[user@cn3200 ~]$ mkdir pepatac_tutorial [user@cn3200 ~]$ export TUTORIAL=$PWD/pepatac_tutorial [user@cn3200 ~]$ cd $TUTORIAL [user@cn3200 ~]$ mkdir data genomes processed templates tools [user@cn3200 ~]$ cd $TUTORIAL/tools [user@cn3200 ~]$ git clone https://github.com/databio/pepatac.git [user@cn3200 ~]$ cd $TUTORIAL/tools/pepatac [user@cn3200 ~]$ git checkout tags/v0.12.2 Previous HEAD position was 7616783... Merge pull request #199 from databio/dev HEAD is now at 1348557... Merge pull request #181 from databio/dev2. Initialize refgenie and download assets:
[user@cn3200 ~]$ cd $TUTORIAL/genomes [user@cn3200 ~]$ export REFGENIE=$TUTORIAL/genomes/genome_config.yaml [user@cn3200 ~]$ refgenie init -c $REFGENIE [user@cn3200 ~]$ refgenie pull hg38/fasta hg38/bowtie2_index hg38/refgene_anno hg38/ensembl_gtf hg38/ensembl_rb [user@cn3200 ~]$ refgenie build hg38/feat_annotation [user@cn3200 ~]$ refgenie pull rCRSd/fasta [user@cn3200 ~]$ refgenie pull rCRSd/bowtie2_indexAdd the export REFGENIE line to your .bashrc or .profile to ensure it persists.
[user@cn3200 ~]$ wget http://big.databio.org/pepatac/tutorial1_r1.fastq.gz [user@cn3200 ~]$ wget http://big.databio.org/pepatac/tutorial1_r2.fastq.gz [user@cn3200 ~]$ wget http://big.databio.org/pepatac/tutorial2_r1.fastq.gz [user@cn3200 ~]$ wget http://big.databio.org/pepatac/tutorial2_r2.fastq.gz [user@cn3200 ~]$ mv *.fastq.gz $TUTORIAL/tools/pepatac/examples/data/4. Configure project files:
[user@cn3200 ~]$ cd $TUTORIAL [user@cn3200 ~]$ export DIVCFG=$TUTORIAL/compute_config.yaml [user@cn3200 ~]$ divvy init --config $DIVCFGAdd the export DIVCFG line to your ~/.bashrc to ensure it persists.
[user@cn3200 ~]$ cd $TUTORIAL/templates [user@cn3200 ~]$ cp $PEPATAC_CONFIG/localhost_template.sub .The PEPATAC pipeline is divided into two major parts:
[user@cn3200 ~]$ cd $TUTORIAL/processed [user@cn3200 ~]$ looper run -c $PEPATAC_EXAMPLES/tutorial/.looper_tutorial_refgenie.yaml Looper version: 2.0.0 Command: run Pipeline name mismatch detected. Pipeline interface: PEPATAC Output schema: None Defaulting to pipeline_interface value. ## [1 of 2] sample: tutorial1; pipeline: PEPATAC Calling pre-submit function: refgenconf.looper_refgenie_populate Refgenie asset (hg38/bowtie2_index) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (hg38/ensembl_gtf) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (hg38/ensembl_rb) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (hg38/fasta) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (hg38/feat_annotation) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (hg38/refgene_anno) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (rCRSd/bowtie2_index) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (rCRSd/fasta) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Writing script to /data/denisovga/PEPATAC/pepatac_tutorial/processed/submission/PEPATAC_tutorial1.sub Job script (n=1; 0.03Gb): /data/denisovga/PEPATAC/pepatac_tutorial/processed/submission/PEPATAC_tutorial1.sub Compute node: cn0811 Start time: 2025-03-06 12:40:04 No pipestat output schema was supplied to PipestatManager. Initializing results file '/data/denisovga/PEPATAC/pepatac_tutorial/processed/results_pipeline/tutorial1/stats.yaml' ... ... Looper finished Samples valid for job generation: 2 of 2The previous command produces two sbatch scripts.
[user@cn3200 ~]$ sbatch $TUTORIAL/processed/submission/PEPATAC_tutorial1.sub [user@cn3200 ~]$ sbatch $TUTORIAL/processed/submission/PEPATAC_tutorial2.sub6. When the two submitted jobs are completed, run the PEPATAC pipeline at the project level using looper:
[user@cn3200 ~]$ looper runp -c $PEPATAC_EXAMPLES/tutorial/.looper_tutorial_refgenie.yaml Looper version: 2.0.0 Command: runp ## [1 of 1] project: PEPATAC_tutorial; pipeline: PEPATAC Writing script to /data/denisovga/PEPATAC/pepatac_tutorial/processed/submission/PEPATAC_PEPATAC_tutorial.sub Job script (n=1; 0.00Gb): /data/denisovga/PEPATAC/pepatac_tutorial/processed/submission/PEPATAC_PEPATAC_tutorial.sub Compute node: cn0811 Start time: 2025-03-09 10:28:44 No pipestat output schema was supplied to PipestatManager. ### Pipeline run code and environment: * Command: `/data/nammid2/pepatac_tutorial/tools/pepatac/pipelines/pepatac_collator.py --config /usr/local/apps/PEPATAC/0.12.2/pepatac-0.12.2/examples/tutorial/tutorial_refgenie_pep_config.yaml -O /data/nammid2/pepatac_tutorial/processed/results_pipeline -P 1 -M 16000 -n PEPATAC_tutorial -r /data/nammid2/pepatac_tutorial/processed/results_pipeline` * Compute host: `cn0006` * Working dir: `/vf/users/nammid2/pepatac_tutorial/processed` * Outfolder: `/data/nammid2/pepatac_tutorial/processed/results_pipeline/summary/` * Log file: `/data/nammid2/pepatac_tutorial/processed/results_pipeline/summary/PEPATAC_log.md` * Start time: (03-09 10:28:45) elapsed: 0.0 _TIME_ ### Version log: * Python version: `3.10.14` * Pypiper dir: `/opt/conda/envs/pepatac/lib/python3.10/site-packages/pypiper` * Pypiper version: `0.14.4` * Pipeline dir: `/vf/users/nammid2/pepatac_tutorial/tools/pepatac/pipelines` * Pipeline version: `0.0.7` * Pipeline hash: `9d4912c9e7d46b733e9475ee9bfde308cf68092f` * Pipeline branch: `* (HEAD detached at v0.12.2)` * Pipeline date: `2025-02-13 07:46:14 -0500` * Pipeline diff: `2 files changed, 0 insertions(+), 0 deletions(-)` ... ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ...End the interactive session:
[user@cn3200 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$