Longshot is a variant calling tool for diploid genomes using long error prone reads such as Pacific Biosciences (PacBio) SMRT and Oxford Nanopore Technologies (ONT). It takes as input an aligned BAM file and outputs a phased VCF file with variants and haplotype information. It can also output haplotype-separated BAM files that can be used for downstream analysis. Currently, it only calls single nucleotide variants (SNVs).
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load longshot
[user@cn3144 ~]$ cd /data/$USER/LONGSHOT_TEST
[user@cn3144 ~]$ ls
genome.fa genome.fa.fai pacbio_reads_30x.bam pacbio_reads_30x.bam.bai
[user@cn3144 ~]$ longshot --bam pacbio_reads_30x.bam --ref genome.fa --out longshot_output.vcf
2019-11-26 09:58:25 Min read coverage set to 6.
2019-11-26 09:58:25 Max read coverage set to 8000.
2019-11-26 09:58:25 Estimating alignment parameters...
2019-11-26 09:58:26 Done estimating alignment parameters.
Transition Probabilities:
match -> match: 0.871
match -> insertion: 0.099
match -> deletion: 0.030
deletion -> match: 0.964
deletion -> deletion: 0.036
insertion -> match: 0.894
insertion -> insertion: 0.106
Emission Probabilities:
match (equal): 0.978
match (not equal): 0.007
insertion: 1.000
deletion: 1.000
GENOTYPE PRIORS:
REF G1/G2 PROB
C D/I 0
G A/A 0.00016666692910805806
G D/I 0
T T/T 0.9985001541646916
A C/D 0
A A/T 0.00033333385823389856
[...]
2019-11-26 09:58:26 Calling potential SNVs using pileup...
2019-11-26 09:58:27 748 potential SNVs identified.
2019-11-26 09:58:27 Generating haplotype fragments from reads...
2019-11-26 09:58:27 10% of variants processed...
2019-11-26 09:58:28 20% of variants processed...
2019-11-26 09:58:28 30% of variants processed...
2019-11-26 09:58:28 40% of variants processed...
2019-11-26 09:58:28 50% of variants processed...
2019-11-26 09:58:28 60% of variants processed...
2019-11-26 09:58:28 70% of variants processed...
2019-11-26 09:58:28 80% of variants processed...
2019-11-26 09:58:28 90% of variants processed...
2019-11-26 09:58:29 100% of variants processed.
2019-11-26 09:58:29 Calling initial genotypes using pair-HMM realignment...
2019-11-26 09:58:29 Iteratively assembling haplotypes and refining genotypes...
2019-11-26 09:58:29 Round 1 of haplotype assembly...
2019-11-26 09:58:29 (Before HapCUT2) Total phased heterozygous SNVs: 468 Total likelihood (phred): 211782.83
2019-11-26 09:58:31 (After HapCUT2) Total phased heterozygous SNVs: 468 Total likelihood (phred): 39775.84
2019-11-26 09:58:31 (After Greedy) Total phased heterozygous SNVs: 468 Total likelihood (phred): 39507.12
2019-11-26 09:58:31 Round 2 of haplotype assembly...
2019-11-26 09:58:31 (Before HapCUT2) Total phased heterozygous SNVs: 476 Total likelihood (phred): 39507.12
2019-11-26 09:58:32 (After HapCUT2) Total phased heterozygous SNVs: 476 Total likelihood (phred): 39507.12
2019-11-26 09:58:33 (After Greedy) Total phased heterozygous SNVs: 476 Total likelihood (phred): 39507.12
2019-11-26 09:58:33 Printing VCF file...
[user@cn3144 ~]$ ls
genome.fa genome.fa.fai longshot_output.vcf pacbio_reads_30x.bam pacbio_reads_30x.bam.bai
[user@cn3144 ~]$ head longshot_output.vcf
##fileformat=VCFv4.2
##source=Longshot v0.3.5
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth of reads passing MAPQ filter">
##INFO=<ID=AC,Number=R,Type=Integer,Description="Number of Observations of Each Allele">
##INFO=<ID=AM,Number=1,Type=Integer,Description="Number of Ambiguous Allele Observations">
##INFO=<ID=MC,Number=1,Type=Integer,Description="Minimum Error Correction (MEC) for this single variant">
##INFO=<ID=MF,Number=1,Type=Float,Description="Minimum Error Correction (MEC) Fraction for this variant.">
##INFO=<ID=MB,Number=1,Type=Float,Description="Minimum Error Correction (MEC) Fraction for this variant's haplotype block.">
##INFO=<ID=AQ,Number=1,Type=Float,Description="Mean Allele Quality value (PHRED-scaled).">
##INFO=<ID=GM,Number=1,Type=Integer,Description="Phased genotype matches unphased genotype (boolean).">
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Create a batch input file (e.g. longshot.sh). For example:
#!/bin/bash
set -e
module load longshot
longshot --bam /data/$USER/LONGSHOT_TEST/pacbio_reads_30x.bam \
--ref /data/$USER/LONGSHOT_TEST/genome.fa \
--out /data/$USER/LONGSHOT_TEST/longshot_output.vcf
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] longshot.sh
Create a swarmfile (e.g. longshot.swarm). For example:
longshot --bam /data/$USER/LONGSHOT_TEST/pacbio_reads_1.bam \
--ref /data/$USER/LONGSHOT_TEST/genome.fa \
--out /data/$USER/LONGSHOT_TEST/output_1.vcf
longshot -A -r chr1 \
--bam /data/$USER/LONGSHOT_TEST/pacbio_reads_2.bam \
--ref /data/$USER/LONGSHOT_TEST/genome.fa \
--out /data/$USER/LONGSHOT_TEST/output_2.vcf
Submit this job using the swarm command.
swarm -f longshot.swarm [-g #] [-t #] --module longshotwhere
| -g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
| -t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
| --module longshot | Loads the longshot module for each subjob in the swarm |