Clair3 is a small variant caller for Illumina, PacBio and ONT long reads. Compare to PEPPER (r0.4), Clair3 (v0.1) shows a better SNP F1-score with ≤30-fold of ONT data (precisionFDA Truth Challenge V2), and a better Indel F1-score, while runs generally four times faster.
Allocate an interactive session and run the program.
Sample session on a GPU node:
[user@biowulf ~]$ sinteractive --gres=gpu:p100:1,lscratch:10 --mem=16g -c4 [user@cn2379 ~]$ module load clair3 [+] Loading singularity 4.0.1 on cn0795 [+] Loading cuDNN/8.1.0.77/CUDA-11.2.2 libraries... [+] Loading CUDA Toolkit 11.2.2 ... [+] Loading clair3 1.0.4 ... [user@cn2379 ~]$ cp -r $CLAIR3_DATA/* . [user@cn2379 ~]$ THREADS=4 [user@cn2379 ~]$ OUTPUT_VCF_FILE_PATH=merge_output.vcf.gzProcessing sample Illumina data
[user@cn2379 ~]$ PLATFORM='ilmn'
[user@cn2379 ~]$ INPUT_DIR="Illumina"
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} .
[user@cn2379 ~]$ REF="GRCh38_chr20.fa"
[user@cn2379 ~]$ BAM="HG003_chr20_demo.bam"
[user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz"
[user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed"
[user@cn2379 ~]$ clair3 \
--bam_fn=${INPUT_DIR}/${BAM} \
--ref_fn=${INPUT_DIR}/${REF} \
--model_path=${CLAIR3_MODELS}/${PLATFORM} \
--threads=${THREADS} \
--platform=${PLATFORM} \
--output=./ \
--bed_fn=${INPUT_DIR}/${BASELINE_BED_FILE_PATH}
...
Processing sample PacBio Hifi data
[user@cn2379 ~]$ PLATFORM='hifi'
[user@cn2379 ~]$ INPUT_DIR="PacBio"
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} .
[user@cn2379 ~]$ REF="GRCh38_no_alt_chr20.fa"
[user@cn2379 ~]$ BAM="HG003_chr20_demo.bam"
[user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz"
[user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed"
[user@cn2379 ~]$ clair3 \
--bam_fn=${INPUT_DIR}/${BAM} \
--ref_fn=${INPUT_DIR}/${REF} \
--threads=${THREADS} \
--platform=${PLATFORM} \
--model_path=${CLAIR3_MODELS}/${PLATFORM} \
--output=./ \
--bed_fn=${INPUT_DIR}/${BASELINE_BED_FILE_PATH}
...
Processing sample ONT data using a custom model
[user@cn2379 ~]$ PLATFORM='ont'
[user@cn2379 ~]$ INPUT_DIR="ONT"
[user@cn2379 ~]$ cp -r $CLAIR3_DATA/${INPUT_DIR} .
[user@cn2379 ~]$ REF="GRCh38_no_alt_chr20.fa"
[user@cn2379 ~]$ BAM="HG003_chr20_demo.bam"
[user@cn2379 ~]$ BASELINE_VCF_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz"
[user@cn2379 ~]$ BASELINE_BED_FILE_PATH="HG003_GRCh38_chr20_v4.2.1_benchmark_noinconsistent.bed"
Download a custom model:
[user@cn2379 ~]$ wget http://www.bio8.cs.hku.hk/clair3/clair3_models/r1041_e82_400bps_sup_v420.tar.gz [user@cn2379 ~]$ tar -zxf r1041_e82_400bps_sup_v420.tar.gz && rm -f r1041_e82_400bps_sup_v420.tar.gzThe last command creates a model folder r1041_e82_400bps_sup_v420.
[user@cn2379 ~]$ clair3 \
--bam_fn=${INPUT_DIR}/${BAM} \
--ref_fn=${INPUT_DIR}/${REF} \
--threads=${THREADS} \
--platform=${PLATFORM} \
--model_path=${PWD}/r1041_e82_400bps_sup_v420 \
--output=./ \
--vcf_fn=${INPUT_DIR}/${BASELINE_VCF_FILE_PATH}
...
[INFO] CLAIR3 VERSION: v1.0.4
[INFO] BAM FILE PATH: /vf/users/denisovga/Clair3/ONT/HG003_chr20_demo.bam
[INFO] REFERENCE FILE PATH: /vf/users/denisovga/Clair3/ONT/GRCh38_no_alt_chr20.fa
[INFO] MODEL PATH: /data/denisovga/Clair3/r1041_e82_400bps_sup_v420
[INFO] OUTPUT FOLDER: /vf/users/denisovga/Clair3/.
[INFO] PLATFORM: ont
[INFO] THREADS: 4
[INFO] BED FILE PATH: EMPTY
[INFO] VCF FILE PATH: /vf/users/denisovga/Clair3/ONT/HG003_GRCh38_chr20_v4.2.1_benchmark.vcf.gz
[INFO] CONTIGS: EMPTY
[INFO] CONDA PREFIX:
[INFO] SAMTOOLS PATH: samtools
[INFO] PYTHON PATH: python3
[INFO] PYPY PATH: pypy3
[INFO] PARALLEL PATH: parallel
[INFO] WHATSHAP PATH: whatshap
[INFO] LONGPHASE PATH: EMPTY
[INFO] CHUNK SIZE: 5000000
[INFO] FULL ALIGN PROPORTION: 0.7
[INFO] FULL ALIGN REFERENCE PROPORTION: 0.1
[INFO] PHASING PROPORTION: 0.7
...
[INFO] 1/7 Call variants using pileup model
Calling variants ...
...
[INFO] 2/7 Select heterozygous SNP variants for Whatshap phasing and haplotagging
[INFO] Select heterozygous pileup variants exceeding phasing quality cutoff 14
[INFO] Total heterozygous SNP positions selected: chr20: 307
real 0m0.345s
user 0m0.263s
sys 0m0.069s
[INFO] 3/7 Phase VCF file using Whatshap
This is WhatsHap 1.7 running under Python 3.9.0
Working on 1 sample from 1 family
# Working on contig chr20 in individual SAMPLE
Found 307 usable heterozygous variants (0 skipped due to missing genotypes)
...
[INFO] 5/7 Select candidates for full-alignment calling
[INFO] Set variants quality cutoff 19.0
[INFO] Set reference calls quality cutoff 5.0
[INFO] Low quality reference calls to be processed in chr20: 6
[INFO] Low quality variants to be processed in chr20: 449
real 0m0.340s
user 0m0.250s
sys 0m0.083s
[INFO] 6/7 Call low-quality variants using full-alignment model
Calling variants ...
Total processed positions in chr20 (chunk 1/1) : 455
Total time elapsed: 5.21 s
real 0m8.084s
user 0m7.120s
sys 0m0.750s
[INFO] 7/7 Merge pileup VCF and full-alignment VCF
[INFO] Pileup variants processed in chr20: 195
[INFO] Full-alignment variants processed in chr20: 445
...
[INFO] Finish calling, output file: /vf/users/denisovga/Clair3/./merge_output.vcf.gz
real 0m59.203s
user 1m55.518s
sys 0m13.950s
Create a batch input file (e.g. clair3.sh). For example:
#!/bin/bash
set -e
module load Clair3
...
clair3 \
--bam_fn=${INPUT_DIR}/${BAM} \
--ref_fn=${INPUT_DIR}/${REF} \ --threads=${THREADS} \
--platform=${PLATFORM} \ --model_path=${CLAIR3_MODELS}/${PLATFORM} \
--output=./ \
--vcf_fn=${INPUT_DIR}/${BASELINE_VCF_FILE_PATH}
Submit this job using the Slurm sbatch command.
sbatch clair3.sh