VADR is a suite of tools for classifying and analyzing sequences homologous to a set of reference models of viral genomes or gene families. It has been mainly tested for analysis of Norovirus, Dengue, and SARS-CoV-2 virus sequences in preparation for submission to the GenBank database.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive [user@cn4471 ~]$ module load vadr [+] Loading hmmer 3.3.2 on cn0847 [+] Loading perl 5.24.3 on cn0847 [+] Loading vadr / 1.1.3 ...Copy sample data to your current folder:
[user@cn4471 ~]$ cp -r $VADR_DATA/* .Preprocess the data with samtools:
[user@cn4471 ~]$ v-build.pl -h ... # v-build.pl :: build homology model of a single sequence for feature annotation # VADR 1.1.3 (Feb 2021) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # date: Mon Mar 29 12:40:22 2021 # Usage: v-build.pl [-options] <accession> <path to output directory to create> basic options: -f : force; if dir <output directory> exists, overwrite it -v : be verbose; output commands to stdout as they're run --stk <s> : read single sequence stockholm 'alignment' from <s> --infa <s> : read single sequence fasta file from <s>, don't fetch it --inft <s> : read feature table file from <s>, don't fetch it --ftfetch1 : fetch feature table with efetch -format ft --ftfetch2 : fetch feature table with efetch -format gbc | xml2tbl --gb : parse a genbank file, not a feature table file --ingb <s> : read genbank file from <s>, don't fetch it --addminfo <s> : add feature info from model info file <s> --forcelong : allow long models > 25Kb in length --keep : do not remove intermediate files, keep them all on disk options for controlling what feature types are stored in model info file [default set is: CDS,gene,mat_peptide]: --fall : store info for all feature types (except those in --fskip) --fadd <s> : also store feature types in comma separated string <s> --fskip <s> : do not store info for feature types in comma separated string <s> options for controlling what qualifiers are stored in model info file [default set is:product,gene,exception]: --qall : store info for all qualifiers (except those in --qskip) --qadd <s> : also store info for qualifiers in comma separated string <s> --qftradd <s> : --qadd <s2> only applies for feature types in comma separated string <s> --qskip <s> : do not store info for qualifiers in comma separated string <s> --noaddgene : do not add gene qualifiers from gene features to overlapping features options for including additional model attributes: --group <s> : specify model group is <s> --subgroup <s> : specify model subgroup is <s> options for controlling CDS translation step: --ttbl <n> : use NCBI translation table <n> to translate CDS [1] options for controlling cmbuild step: --cmn <n> : set number of seqs for glocal fwd HMM calibration to <n> --cmp7ml : set CM's filter p7 HMM as the ML p7 HMM --cmere <x> : set CM relative entropy target to <x> --cmeset <x> : set CM eff seq # for CM to <x> --cmemaxseq <x> : set CM maximum alowed eff seq # for CM to <x> --cminfile <s> : read cmbuild options from file <s> options for skipping stages: --skipbuild : skip the cmbuild and blastn db creation steps --onlyurl : output genbank file url for accession and exit optional output files: --ftrinfo : create file with internal feature information --sgminfo : create file with internal segment information other expert options: --execname <s> : define executable name of this script as <s>
[user@cn4471 ~]$ v-build.pl NC_039897 NC_039897 # v-build.pl :: build homology model of a single sequence for feature annotation # VADR 1.1.3 (Feb 2021) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # date: Tue Mar 30 14:51:04 2021 # $VADRBLASTDIR: /usr/local/apps/VADR/1.1.3/vadr-vadr-1.1.3/ncbi-blast/bin # $VADREASELDIR: /usr/local/apps/VADR/1.1.3/vadr-vadr-1.1.3/Bio-Easel/src/easel/miniapps # $VADRINFERNALDIR: /usr/local/apps/VADR/1.1.3/vadr-vadr-1.1.3/infernal/binaries # $VADRSCRIPTSDIR: /usr/local/apps/VADR/1.1.3/vadr-vadr-1.1.3/vadr # # accession/model name: NC_039897 # output directory: NC_039897 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Fetching FASTA file ... done. [ 3.3 seconds] # Parsing FASTA file ... done. [ 0.0 seconds] # Fetching feature table file ... done. [ 3.9 seconds] # Parsing feature table file ... done. [ 0.0 seconds] # Fetching and parsing protein feature table file(s) ... done. [ 11.8 seconds] # Pruning data read from GenBank ... done. [ 0.0 seconds] # Reformatting FASTA file to Stockholm file ... done. [ 0.0 seconds] # Finalizing feature information ... done. [ 0.0 seconds] # Translating CDS ... done. [ 0.0 seconds] # Building BLAST protein database ... done. [ 0.2 seconds] # Building HMMER protein database ... done. [ 2.6 seconds] # Building CM (should take roughly 10-30 minutes) ... done. [ 749.1 seconds] # Pressing CM file ... done. [ 0.2 seconds] # Building BLAST nucleotide database of CM consensus ... done. [ 0.4 seconds] # Creating model info file ... done. [ 0.0 seconds] # # Output printed to screen saved in: NC_039897.vadr.log # List of executed commands saved in: NC_039897.vadr.cmd # List and description of all output files saved in: NC_039897.vadr.filelist # fasta file for NC_039897 saved in: NC_039897.vadr.fa # feature table format file for NC_039897 saved in: NC_039897.vadr.tbl # feature table format file for YP_009538340.1 saved in: NC_039897.vadr.YP_009538340.1.tbl # feature table format file for YP_009538341.1 saved in: NC_039897.vadr.YP_009538341.1.tbl # feature table format file for YP_009538342.1 saved in: NC_039897.vadr.YP_009538342.1.tbl # Stockholm alignment file for NC_039897 saved in: NC_039897.vadr.stk # fasta sequence file for CDS from NC_039897 saved in: NC_039897.vadr.cds.fa # fasta sequence file for translated CDS from NC_039897 saved in: NC_039897.vadr.protein.fa # BLAST db .phr file for NC_039897 saved in: NC_039897.vadr.protein.fa.phr # BLAST db .pin file for NC_039897 saved in: NC_039897.vadr.protein.fa.pin # BLAST db .psq file for NC_039897 saved in: NC_039897.vadr.protein.fa.psq # BLAST db .pdb file for NC_039897 saved in: NC_039897.vadr.protein.fa.pdb # BLAST db .pot file for NC_039897 saved in: NC_039897.vadr.protein.fa.pot # BLAST db .ptf file for NC_039897 saved in: NC_039897.vadr.protein.fa.ptf # BLAST db .pto file for NC_039897 saved in: NC_039897.vadr.protein.fa.pto # HMMER model db file for NC_039897 saved in: NC_039897.vadr.protein.hmm # hmmbuild build output (concatenated) saved in: NC_039897.vadr.protein.hmmbuild # binary HMM and p7 HMM filter file saved in: NC_039897.vadr.protein.hmm.h3m # SSI index for binary HMM file saved in: NC_039897.vadr.protein.hmm.h3i # optimized p7 HMM filters (MSV part) saved in: NC_039897.vadr.protein.hmm.h3f # optimized p7 HMM filters (remainder) saved in: NC_039897.vadr.protein.hmm.h3p # hmmpress output file saved in: NC_039897.vadr.hmmpress # CM file saved in: NC_039897.vadr.cm # cmbuild output file saved in: NC_039897.vadr.cmbuild # binary CM and p7 HMM filter file saved in: NC_039897.vadr.cm.i1m # SSI index for binary CM file saved in: NC_039897.vadr.cm.i1i # optimized p7 HMM filters (MSV part) saved in: NC_039897.vadr.cm.i1f # optimized p7 HMM filters (remainder) saved in: NC_039897.vadr.cm.i1p # cmpress output file saved in: NC_039897.vadr.cmpress # fasta sequence file with cmemit consensus sequence for NC_039897 saved in: NC_039897.vadr.nt.fa # BLAST db .nhr file for NC_039897 saved in: NC_039897.vadr.nt.fa.nhr # BLAST db .nin file for NC_039897 saved in: NC_039897.vadr.nt.fa.nin # BLAST db .nsq file for NC_039897 saved in: NC_039897.vadr.nt.fa.nsq # BLAST db .ndb file for NC_039897 saved in: NC_039897.vadr.nt.fa.ndb # BLAST db .not file for NC_039897 saved in: NC_039897.vadr.nt.fa.not # BLAST db .ntf file for NC_039897 saved in: NC_039897.vadr.nt.fa.ntf # BLAST db .nto file for NC_039897 saved in: NC_039897.vadr.nt.fa.nto # VADR 'model info' format file for NC_039897 saved in: NC_039897.vadr.minfo # # All output files created in directory ./NC_039897/ # # Elapsed time: 00:12:51.55 # hh:mm:ss # [ok]
[user@cn4471 ~]$ v-annotate.pl -h
# v-annotate.pl :: classify and annotate sequences using a CM library
# VADR 1.1.3 (Feb 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date: Tue Mar 30 15:17:51 2021
#
Usage: v-annotate.pl [-options] <fasta file to annotate> <output directory to create>
basic options:
-f : force; if output dir exists, overwrite it
-v : be verbose; output commands to stdout as they're run
--atgonly : only consider ATG a valid start codon
--minpvlen <n> : min CDS/mat_peptide/gene length for feature table output and protein validation is <n> [30]
--keep : do not remove intermediate files, keep them all on disk
options for specifying classification:
--group <s> : set expected classification of all seqs to group <s>
--subgroup <s> : set expected classification of all seqs to subgroup <s>
options for controlling severity of alerts:
--alt_list : output summary of all alerts and exit
--alt_pass <s> : specify that alert codes in comma-separated <s> do not cause FAILure
--alt_fail <s> : specify that alert codes in comma-separated <s> do cause FAILure
--alt_mnf_yes <s> : alert codes in <s> for 'misc_not_failure' features cause misc_feature-ization, not failure
--alt_mnf_no <s> : alert codes in <s> for 'misc_not_failure' features cause failure, not misc-feature-ization
--ignore_mnf : ignore non-zero 'misc_not_feature' values in .minfo file, set to 0 for all features/models
options related to model files:
-m <s> : use CM file <s> instead of default
-a <s> : use HMM file <s> instead of default
-i <s> : use model info file <s> instead of default
-n <s> : use blastn db file <s> instead of default
-x <s> : blastx dbs are in dir <s>, instead of default
--mkey <s> : .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr'
--mdir <s> : model files are in directory <s>, not in $VADRMODELDIR
--mlist <s> : only use models listed in file <s>
options for controlling output feature table:
--nomisc : in feature table for failed seqs, never change feature type to misc_feature
--notrim : in feature table, don't trim coords due to Ns (for any feature types)
--noftrtrim <s> : in feature table, don't trim coords due to Ns for feature types in comma-delmited <s>
--noprotid : in feature table, don't add protein_id for CDS and mat_peptides
--forceprotid : in feature table, force protein_id value to be sequence name, then idx
options for controlling thresholds related to alerts:
--lowsc <x> : lowscore/LOW_SCORE bits per nucleotide threshold is <x> [0.3]
--indefclass <x> : indfcls/INDEFINITE_CLASSIFICATION bits per nucleotide diff threshold is <x> [0.03]
--incspec <x> : inc{group,subgrp}/INCORRECT_{GROUP,SUBGROUP} bits/nt threshold is <x> [0.2]
--lowcov <x> : lowcovrg/LOW_COVERAGE fractional coverage threshold is <x> [0.9]
--dupregolp <n> : dupregin/DUPLICATE_REGIONS minimum model overlap is <n> [20]
--dupregsc <x> : dupregin/DUPLICATE_REGIONS minimum bit score is <x> [10]
--indefstr <x> : indfstrn/INDEFINITE_STRAND minimum weaker strand bit score is <x> [25]
--lowsim5term <n> : lowsim5{s,f}/LOW_{FEATURE_}SIMILARITY_START minimum length is <n> [15]
--lowsim3term <n> : lowsim3{s,f}/LOW_{FEATURE_}SIMILARITY_END minimum length is <n> [15]
--lowsimint <n> : lowsimi{s,f}/LOW_{FEATURE_}SIMILARITY (internal) minimum length is <n> [1]
--biasfract <x> : biasdseq/BIASED_SEQUENCE fractional threshold is <x> [0.25]
--indefann <x> : indf{5,3}loc/'INDEFINITE_ANNOTATION_{START,END} non-mat_peptide min allowed post probability is <x> [0.8]
--indefann_mp <x> : indf{5,3}loc/'INDEFINITE_ANNOTATION_{START,END} mat_peptide min allowed post probability is <x> [0.6]
--fstminnt <n> : fst{hi,lo}cnf/POSSIBLE_FRAMESHIFT_{HIGH,LOW}_CONF max allowed frame disagreement nt length w/o alert is <n> [6]
--fsthighthr <x> : fsthicnf/POSSIBLE_FRAMESHIFT_HIGH_CONF minimum average probability for alert is <x> [0.8]
--fstlowthr <x> : fstlocnf/POSSIBLE_FRAMESHIFT_LOW_CONF minimum average probability for alert is <x> [0.3]
--xalntol <n> : indf{5,3}{st,lg}/INDEFINITE_ANNOTATION_{START,END} max allowed nt diff blastx start/end is <n> [5]
--xmaxins <n> : insertnp/INSERTION_OF_NT max allowed nucleotide insertion length in blastx validation is <n> [27]
--xmaxdel <n> : deletinp/DELETION_OF_NT max allowed nucleotide deletion length in blastx validation is <n> [27]
--nmaxins <n> : insertnn/INSERTION_OF_NT max allowed nucleotide (nt) insertion length in CDS nt alignment is <n> [27]
--nmaxdel <n> : deletinn/DELETION_OF_NT max allowed nucleotide (nt) deletion length in CDS nt alignment is <n> [27]
--xlonescore <n> : indfantp/INDEFINITE_ANNOTATION min score for a blastx hit not supported by CM analysis is <n> [80]
--hlonescore <n> : indfantp/INDEFINITE_ANNOTATION min score for a hmmer hit not supported by CM analysis is <n> [10]
options for controlling cmalign alignment stage:
--mxsize <n> : set max allowed memory for cmalign to <n> Mb [16000]
--tau <x> : set the initial tau value for cmalign to <x> [0.001]
--nofixedtau : do not fix the tau value when running cmalign, allow it to decrease if nec
--nosub : use alternative alignment strategy for truncated sequences
--noglocal : do not run cmalign in glocal mode (run in local mode)
options for controlling blastx protein validation stage:
--xmatrix <s> : use the matrix <s> with blastx (e.g. BLOSUM45)
--xdrop <n> : set the xdrop value for blastx to <n> [25]
--xnumali <n> : number of alignments to keep in blastx output and consider if --xlongest is <n> [20]
--xlongest : keep the longest blastx hit, not the highest scoring one
options for using hmmer instead of blastx for protein validation:
--hmmer : use hmmer for protein validation, not blastx
--h_max : use --max option with hmmsearch
--h_minbit <x> : set minimum hmmsearch bit score threshold to <x> [-10]
options related to blastn-derived seeded alignment acceleration:
-s : use the max length ungapped region from blastn to seed the alignment
--s_blastnws <n> : for -s, set blastn -word_size <n> to <n> [7]
--s_blastnsc <x> : for -s, set blastn minimum HSP score to consider to <x> [50]
--s_overhang <n> : for -s, set length of nt overhang for subseqs to align to <n> [100]
options related to replacing Ns with expected nucleotides:
-r : replace stretches of Ns with expected nts, where possible
--r_minlen <n> : minimum length subsequence to replace Ns in is <n> [5]
--r_minfract <x> : minimum fraction of Ns in subseq to trigger replacement is <x> [0.5]
--r_fetchr : fetch features for output fastas from seqs w/Ns replaced, not originals
--r_cdsmpr : detect CDS and MP alerts in sequences w/Ns replaced, not originals
--r_pvorig : use original sequences for protein validation, not replaced seqs
--r_prof : use slower profile methods, not blastn, to identify Ns to replace
options related to parallelization on compute farm:
-p : parallelize cmsearch/cmalign on a compute farm
-q <s> : use qsub info file <s> instead of default
--nkb <n> : number of KB of sequence for each farm job is <n> [10]
--wait <n> : allow <n> wall-clock minutes for jobs on farm to finish, including queueing time [500]
--errcheck : consider any farm stderr output as indicating a job failure
--maxnjobs <n> : set max number of jobs to submit to compute farm to <n> [2500]
options for skipping stages:
--skip_align : skip the cmalign step, use results from an earlier run of the script
--skip_pv : do not perform blastx-based protein validation
optional output files:
--out_stk : output per-model full length stockholm alignments (.stk)
--out_afa : output per-model full length fasta alignments (.afa)
--out_rpstk : with -r, output stockholm alignments of seqs with Ns replaced
--out_rpafa : with -r, output fasta alignments of seqs with Ns replaced
--out_nofs : do not output frameshift stockholm alignment files
--out_nofasta : do not output fasta files of features, or passing/failing seqs
--out_debug : dump voluminous info from various data structures to output files
other expert options:
--execname <s> : define executable name of this script as <s>
--alicheck : for debugging, check aligned sequence vs input sequence for identity
--noseqnamemax : do not enforce a maximum length of 50 for sequence names (GenBank max)
--minbit <x> : set minimum cmsearch bit score threshold to <x> [-10]
--origfa : do not copy fasta file prior to analysis, use original
--msub <s> : read model substitution file from <s>
--xsub <s> : read blastx db substitution file from <s>
End the interactive session:
[user@cn4471 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$