TALON is a Python package for identifying and quantifying known and novel genes/isoforms in long-read transcriptome data sets. TALON is technology-agnostic in that it works from mapped SAM files, allowing data from different sequencing platforms (i.e. PacBio and Oxford Nanopore) to be analyzed side by side.
$TALON_TEST_DATAAllocate an interactive session and run through the steps using 2 replicates of human cardiac atrium tissue runs on a PacBio Sequel II:
[user@biowulf]$ sinteractive --cpus-per-task=6 --mem=16G --gres=lscratch:50
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load talon
[user@cn3144]$ cp -L ${TALON_TEST_DATA:-none}/* .
[user@cn3144]$ ls -lh
total 4.6G
-rw-r--r-- 1 user group 172M Apr 20 14:31 ENCFF291EKY.bam
-rw-r--r-- 1 user group 1.7M Apr 20 14:31 ENCFF291EKY.bam.bai
-rw-r--r-- 1 user group 189M Apr 20 14:31 ENCFF613SDS.bam
-rw-r--r-- 1 user group 1.7M Apr 20 14:31 ENCFF613SDS.bam.bai
-rw-r--r-- 1 user group 1.3G Apr 20 14:31 gencode.v35.primary_assembly.annotation.gtf
-rw-r--r-- 1 user group 3.0G Apr 20 14:31 GRCh38.primary_assembly.genome.fa
-rw-r--r-- 1 user group 6.4K Apr 20 14:31 GRCh38.primary_assembly.genome.fa.fai
[user@cn3144]$ gtf=gencode.v35.primary_assembly.annotation.gtf
[user@cn3144]$ genome=GRCh38.primary_assembly.genome.fa
[user@cn3144]$ bam1=ENCFF291EKY.bam
[user@cn3144]$ bam2=ENCFF613SDS.bam
[user@cn3144]$ talon_initialize_database \
--f $gtf \
--a gencode_35 \
--g GRCh38 \
--o example_talon
chr1
bulk update genes...
bulk update gene_annotations...
bulk update transcripts...
[...snip...]
[user@cn3144]$ mkdir -p labeled tmp
[user@cn3144]$ ### check internal priming sites
[user@cn3144]$ talon_label_reads --f $bam1 \
--g $genome \
--t $SLURM_CPUS_PER_TASK \
--ar 20 \
--tmpDir=/lscratch/$SLURM_JOB_ID/tmp \
--deleteTmp \
--o labeled/${bam1%.bam}
[ 2021-04-20 17:10:44 ] Started talon_label_reads run.
[ 2021-04-20 17:10:44 ] Splitting SAM by chromosome...
[ 2021-04-20 17:10:44 ] -----Writing chrom files...
[ 2021-04-20 17:10:59 ] Launching parallel jobs...
[ 2021-04-20 17:11:14 ] Pooling output files...
[ 2021-04-20 17:11:27 ] Run complete
[user@cn3144]$ talon_label_reads --f $bam2 \
--g $genome \
--t $SLURM_CPUS_PER_TASK \
--ar 20 \
--tmpDir=/lscratch/$SLURM_JOB_ID/tmp \
--deleteTmp \
--o labeled/${bam2%.bam}
[...snip...]
[user@cn3144]$ ### run talon annotator
[user@cn3144]$ cat > config.csv <<__EOF__
ex_rep1,GRCh38,PacBio-Sequel2,labeled/${bam1%.bam}_labeled.sam
ex_rep2,GRCh38,PacBio-Sequel2,labeled/${bam2%.bam}_labeled.sam
__EOF__
[user@cn3144]$ talon \
-t $SLURM_CPUS_PER_TASK \
--f config.csv \
--db example_talon.db \
--build GRCh38 \
--o example
[user@cn3144]$ ### summarize results
[user@cn3144]$ talon_summarize \
--db example_talon.db \
--v \
--o example
[user@cn3144]$ ### run any other tools and then copy results back to shared space
[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$
Create a batch input file (e.g. talon.sh), which uses the input file 'talon.in'. For example:
#! /bin/bash
module load talon/5.0
bam1=ENCFF291EKY.bam
bam2=ENCFF613SDS.bam
gtf=gencode.v35.primary_assembly.annotation.gtf
genome=GRCh38.primary_assembly.genome.fa
cd /lscratch/$SLURM_JOB_ID
cp -L ${TALON_TEST_DATA:-none}/* .
talon_initialize_database \
--f $gtf \
--a gencode_35 \
--g GRCh38 \
--o example_talon
mkdir -p labeled tmp
talon_label_reads --f $bam1 \
--g $genome \
--t $SLURM_CPUS_PER_TASK \
--ar 20 \
--tmpDir=/lscratch/$SLURM_JOB_ID/tmp \
--deleteTmp \
--o labeled/${bam1%.bam}
talon_label_reads --f $bam2 \
--g $genome \
--t $SLURM_CPUS_PER_TASK \
--ar 20 \
--tmpDir=/lscratch/$SLURM_JOB_ID/tmp \
--deleteTmp \
--o labeled/${bam2%.bam}
cat > config.csv <<__EOF__
ex_rep1,GRCh38,PacBio-Sequel2,labeled/${bam1%.bam}_labeled.sam
ex_rep2,GRCh38,PacBio-Sequel2,labeled/${bam2%.bam}_labeled.sam
__EOF__
talon \
-t $SLURM_CPUS_PER_TASK \
--f config.csv \
--db example_talon.db \
--build GRCh38 \
--o example
talon_summarize \
--db example_talon.db \
--v \
--o example
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] talon.sh
Create a swarmfile (e.g. talon.swarm). For example:
talon_label_reads --f ENCFF291EKY.bam \
--g GRCh38.primary_assembly.genome.fa \
--t $SLURM_CPUS_PER_TASK \
--ar 20 \
--tmpDir=/lscratch/$SLURM_JOB_ID \
--deleteTmp \
--o labeled/ENCFF291EKY
talon_label_reads --f ENCFF613SDS.bam \
--g GRCh38.primary_assembly.genome.fa \
--t $SLURM_CPUS_PER_TASK \
--ar 20 \
--tmpDir=/lscratch/$SLURM_JOB_ID \
--deleteTmp \
--o labeled/ENCFF613SDS
Submit this job using the swarm command.
swarm -f talon.swarm [-g 10] [-t 6] --gres=lscratch:50 --module talon/5.0where
| -g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
| -t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
| --module talon | Loads the talon module for each subjob in the swarm |