Panaroo: An updated pipeline for pangenome investigation
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load panaroo
[+] Loading panaroo 1.4.2 on cn3144
[+] Loading singularity 4.0.1 on cn3144
[user@cn3144 ~]$ panaroo -h
usage: panaroo [-h] -i INPUT_FILES [INPUT_FILES ...] -o OUTPUT_DIR --clean-mode
{strict,moderate,sensitive} [--remove-invalid-genes] [-c ID]
[-f FAMILY_THRESHOLD] [--len_dif_percent LEN_DIF_PERCENT]
[--merge_paralogs] [--search_radius SEARCH_RADIUS]
[--refind_prop_match REFIND_PROP_MATCH] [--refind_strict]
[--min_trailing_support MIN_TRAILING_SUPPORT]
[--trailing_recursive TRAILING_RECURSIVE]
[--edge_support_threshold EDGE_SUPPORT_THRESHOLD]
[--length_outlier_support_proportion LENGTH_OUTLIER_SUPPORT_PROPORTION]
[--remove_by_consensus {True,False}] [--high_var_flag CYCLE_THRESHOLD_MIN]
[--min_edge_support_sv MIN_EDGE_SUPPORT_SV] [--all_seq_in_graph]
[--no_clean_edges] [-a {core,pan}] [--aligner {prank,clustal,mafft}]
[--codons] [--core_threshold CORE] [--core_subset SUBSET]
[--core_entropy_filter HC_THRESHOLD] [-t N_CPU] [--codon-table TABLE]
[--quiet] [--version]
panaroo: an updated pipeline for pangenome investigation
options:
-h, --help show this help message and exit
-t N_CPU, --threads N_CPU
number of threads to use (default=1)
--codon-table TABLE the codon table to use for translation (default=11)
--quiet suppress additional output
--version show program's version number and exit
Input/output:
-i INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...]
input GFF3 files (usually output from running Prokka). Can also
take a file listing each gff file line by line.
-o OUTPUT_DIR, --out_dir OUTPUT_DIR
location of an output directory
Mode:
--clean-mode {strict,moderate,sensitive}
The stringency mode at which to run panaroo. Must be
one of 'strict','moderate' or 'sensitive'. Each of
these modes can be fine tuned using the additional
parameters in the 'Graph correction' section.
strict:
Requires fairly strong evidence (present in at least
5% of genomes) to keep likely contaminant genes. Will
remove genes that are refound more often than they were
called originally.
moderate:
Requires moderate evidence (present in at least 1% of
genomes) to keep likely contaminant genes. Keeps genes
that are refound more often than they were called
originally.
sensitive:
Does not delete any genes and only performes merge and
refinding operations. Useful if rare plasmids are of
interest as these are often hard to disguish from
contamination. Results will likely include higher
number of spurious annotations.
--remove-invalid-genes
removes annotations that do not conform to the expected Prokka
format such as those including premature stop codons.
Matching:
-c ID, --threshold ID
sequence identity threshold (default=0.98)
-f FAMILY_THRESHOLD, --family_threshold FAMILY_THRESHOLD
protein family sequence identity threshold (default=0.7)
--len_dif_percent LEN_DIF_PERCENT
length difference cutoff (default=0.98)
--merge_paralogs don't split paralogs
Refind:
--search_radius SEARCH_RADIUS
the distance in nucleotides surronding the neighbour of an
accessory gene in which to search for it
--refind_prop_match REFIND_PROP_MATCH
the proportion of an accessory gene that must be found in order
to consider it a match
--refind_strict Prevent fragmented, misassembled, or potential pseudogene
sequences from being re-found.
Graph correction:
--min_trailing_support MIN_TRAILING_SUPPORT
minimum cluster size to keep a gene called at the end of a contig
--trailing_recursive TRAILING_RECURSIVE
number of times to perform recursive trimming of low support
nodes near the end of contigs
--edge_support_threshold EDGE_SUPPORT_THRESHOLD
minimum support required to keep an edge that has been flagged as
a possible mis-assembly
--length_outlier_support_proportion LENGTH_OUTLIER_SUPPORT_PROPORTION
proportion of genomes supporting a gene with a length more than
1.5x outside the interquatile range for genes in the same cluster
(default=0.01). Genes failing this test will be re-annotated at
the shorter length
--remove_by_consensus {True,False}
if a gene is called in the same region with similar sequence a
minority of the time, remove it. One of 'True' or 'False'
--high_var_flag CYCLE_THRESHOLD_MIN
minimum number of nested cycles to call a highly variable gene
region (default = 5).
--min_edge_support_sv MIN_EDGE_SUPPORT_SV
minimum edge support required to call structural variants in the
presence/absence sv file
--all_seq_in_graph Retains all DNA sequence for each gene cluster in the graph
output. Off by default as it uses a large amount of space.
--no_clean_edges Turn off edge filtering in the final output graph.
Gene alignment:
-a {core,pan}, --alignment {core,pan}
Output alignments of core genes or all genes. Options are 'core'
and 'pan'. Default: 'None'
--aligner {prank,clustal,mafft}
Specify an aligner. Options:'prank', 'clustal', and default:
'mafft'
--codons Generate codon alignments by aligning sequences at the protein
level
--core_threshold CORE
Core-genome sample threshold (default=0.95)
--core_subset SUBSET Randomly subset the core genome to these many genes (default=all)
--core_entropy_filter HC_THRESHOLD
Manually set the Block Mapping and Gathering with Entropy (BMGE)
filter. Can be between 0.0 and 1.0. By default this is set using
the Tukey outlier method.
Create a batch input file (e.g. panaroo.sh). For example:
#!/bin/bash set -e module load panaroo panaroo -i input.gff -o results --clean-mode sensitive
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] panaroo.sh
Create a swarmfile (e.g. panaroo.swarm). For example:
panaroo -i *.gff -o results --clean-mode strict panaroo -i *.gff -o results --clean-mode strict panaroo -i *.gff -o results --clean-mode strict panaroo -i *.gff -o results --clean-mode strict
Submit this job using the swarm command.
swarm -f panaroo.swarm [-g #] [-t #] --module panaroowhere
| -g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
| -t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
| --module panaroo | Loads the panaroo module for each subjob in the swarm |