TALON is a Python package for identifying and quantifying known and novel genes/isoforms in long-read transcriptome data sets. TALON is technology-agnostic in that it works from mapped SAM files, allowing data from different sequencing platforms (i.e. PacBio and Oxford Nanopore) to be analyzed side by side.

References:

Dana Wyman et al., A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. bioRxiv

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run through the steps using 2 replicates of human cardiac atrium tissue runs on a PacBio Sequel II:

[user@biowulf]$ sinteractive --cpus-per-task=6 --mem=16G --gres=lscratch:50 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ module load talon [user@cn3144]$ cp -L ${TALON_TEST_DATA:-none}/* . [user@cn3144]$ ls -lh total 4.6G -rw-r--r-- 1 user group 172M Apr 20 14:31 ENCFF291EKY.bam -rw-r--r-- 1 user group 1.7M Apr 20 14:31 ENCFF291EKY.bam.bai -rw-r--r-- 1 user group 189M Apr 20 14:31 ENCFF613SDS.bam -rw-r--r-- 1 user group 1.7M Apr 20 14:31 ENCFF613SDS.bam.bai -rw-r--r-- 1 user group 1.3G Apr 20 14:31 gencode.v35.primary_assembly.annotation.gtf -rw-r--r-- 1 user group 3.0G Apr 20 14:31 GRCh38.primary_assembly.genome.fa -rw-r--r-- 1 user group 6.4K Apr 20 14:31 GRCh38.primary_assembly.genome.fa.fai [user@cn3144]$ gtf=gencode.v35.primary_assembly.annotation.gtf [user@cn3144]$ genome=GRCh38.primary_assembly.genome.fa [user@cn3144]$ bam1=ENCFF291EKY.bam [user@cn3144]$ bam2=ENCFF613SDS.bam [user@cn3144]$ talon_initialize_database \ --f $gtf \ --a gencode_35 \ --g GRCh38 \ --o example_talon chr1 bulk update genes... bulk update gene_annotations... bulk update transcripts... [...snip...] [user@cn3144]$ mkdir -p labeled tmp [user@cn3144]$ ### check internal priming sites [user@cn3144]$ talon_label_reads --f $bam1 \ --g $genome \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID/tmp \ --deleteTmp \ --o labeled/${bam1%.bam} [ 2021-04-20 17:10:44 ] Started talon_label_reads run. [ 2021-04-20 17:10:44 ] Splitting SAM by chromosome... [ 2021-04-20 17:10:44 ] -----Writing chrom files... [ 2021-04-20 17:10:59 ] Launching parallel jobs... [ 2021-04-20 17:11:14 ] Pooling output files... [ 2021-04-20 17:11:27 ] Run complete [user@cn3144]$ talon_label_reads --f $bam2 \ --g $genome \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID/tmp \ --deleteTmp \ --o labeled/${bam2%.bam} [...snip...] [user@cn3144]$ ### run talon annotator [user@cn3144]$ cat > config.csv <<__EOF__ ex_rep1,GRCh38,PacBio-Sequel2,labeled/${bam1%.bam}_labeled.sam ex_rep2,GRCh38,PacBio-Sequel2,labeled/${bam2%.bam}_labeled.sam __EOF__ [user@cn3144]$ talon \ -t $SLURM_CPUS_PER_TASK \ --f config.csv \ --db example_talon.db \ --build GRCh38 \ --o example [user@cn3144]$ ### summarize results [user@cn3144]$ talon_summarize \ --db example_talon.db \ --v \ --o example [user@cn3144]$ ### run any other tools and then copy results back to shared space [user@cn3144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf]$

Create a batch input file (e.g. talon.sh), which uses the input file 'talon.in'. For example:

#! /bin/bash module load talon/5.0 bam1=ENCFF291EKY.bam bam2=ENCFF613SDS.bam gtf=gencode.v35.primary_assembly.annotation.gtf genome=GRCh38.primary_assembly.genome.fa cd /lscratch/$SLURM_JOB_ID cp -L ${TALON_TEST_DATA:-none}/* . talon_initialize_database \ --f $gtf \ --a gencode_35 \ --g GRCh38 \ --o example_talon mkdir -p labeled tmp talon_label_reads --f $bam1 \ --g $genome \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID/tmp \ --deleteTmp \ --o labeled/${bam1%.bam} talon_label_reads --f $bam2 \ --g $genome \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID/tmp \ --deleteTmp \ --o labeled/${bam2%.bam} cat > config.csv <<__EOF__ ex_rep1,GRCh38,PacBio-Sequel2,labeled/${bam1%.bam}_labeled.sam ex_rep2,GRCh38,PacBio-Sequel2,labeled/${bam2%.bam}_labeled.sam __EOF__ talon \ -t $SLURM_CPUS_PER_TASK \ --f config.csv \ --db example_talon.db \ --build GRCh38 \ --o example talon_summarize \ --db example_talon.db \ --v \ --o example

A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

talon_label_reads --f ENCFF291EKY.bam \ --g GRCh38.primary_assembly.genome.fa \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID \ --deleteTmp \ --o labeled/ENCFF291EKY talon_label_reads --f ENCFF613SDS.bam \ --g GRCh38.primary_assembly.genome.fa \ --t $SLURM_CPUS_PER_TASK \ --ar 20 \ --tmpDir=/lscratch/$SLURM_JOB_ID \ --deleteTmp \ --o labeled/ENCFF613SDS

-g #	Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t #	Number of threads/CPUs required for each process (1 line in the swarm command file).
--module talon	Loads the talon module for each subjob in the swarm