OncodriveCLUSTL is a sequence-based clustering algorithm to detect significant clustering signals across genomic regions. It is based based on a local background model derived from the simulation of mutations accounting for the composition of trior penta-nucleotide context substitutions observed in the cohort under study.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=75g -c20 --gres=lscratch:20
[user@cn3335 ~]$ module load oncodriveCLUSTL
[+] Loading singularity 3.10.5 on cn4338
[+] Loading oncodriveCLUSTL 1.1.1
[user@cn3335 ~]$ oncodriveclustl -h
Usage: oncodriveclustl [OPTIONS]
OncodriveCLUSTL is a sequence based clustering method to identify cancer
drivers across the genome
Args: input_file (str): path to mutations file regions_file (str):
path to input genomic coordinates file output_directory(str): path to
output directory. Output files will be generated in it.
input_signature (str): path to file containing input context based
mutational probabilities. By default (when no input signatures),
OncodriveCLUSTL will calculate them from the mutations input file.
elements_file (str): path to file containing one element per row
(optional) to analyzed the listed elements. By default,
OncodriveCLUSTL analyzes all genomic elements contained in `regions_file`.
elements (str): genomic element symbol (optional). The analysis will be
performed only on the specified GEs. genome (str): genome to use:
'hg38', 'hg19', 'mm10', 'c3h', 'car', 'cast' and 'f344'
element_mutations (int): minimum number of mutations per genomic element
to undertake analysis cluster_mutations (int): minimum number of
mutations to define a cluster smooth_window (int): Tukey kernel
smoothing window length cluster_window (int): clustering window length
kmer (int): context nucleotides to calculate the mutational probabilities
(trinucleotides or pentanucleotides) n_simulations (int): number of
simulations simulation_mode (str): simulation mode
simulation_window (int): window length to simulate mutations
signature_calculation (str): signature calculation, mutation frequencies
(default) or mutation counts normalized by k-mer region counts
signature_group (str): header of the column to group signatures. One
signature will be computed for each group cores (int): number of CPUs
to use seed (int): seed log_level (str): verbosity of the logger
concatenate (bool): flag to calculate clustering on collapsed genomic
regions (e.g., coding regions in a gene) clustplot (bool): flag to
generate a needle plot with clusters for an element qqplot (bool):
flat to generate a quantile-quantile (QQ) plot for a dataset gzip
(bool): flag to generate GZIP compressed output files
Returns: None
Options:
-i, --input-file PATH File containing somatic mutations
[required]
-r, --regions-file PATH File with the genomic regions to analyze
[required]
-o, --output-directory TEXT Output directory to be created [required]
-sig, --input-signature PATH File containing input context based
mutational probabilities (signature)
-ef, --elements-file PATH File with the symbols of the elements to
analyze
-e, --elements TEXT Symbol of the element(s) to analyze
-g, --genome [hg38|hg19|mm10|c3h|car|cast|f344]
Genome to use
-emut, --element-mutations INTEGER
Cutoff of element mutations. Default is 2
-cmut, --cluster-mutations INTEGER
Cutoff of cluster mutations. Default is 2
-sw, --smooth-window INTEGER RANGE
Smoothing window. Default is 11
-cw, --cluster-window INTEGER RANGE
Cluster window. Default is 11
-kmer, --kmer [3|5] K-mer nucleotide context
-n, --n-simulations INTEGER number of simulations. Default is 1000
-sim, --simulation-mode [mutation_centered|region_restricted]
Simulation mode
-simw, --simulation-window INTEGER RANGE
Simulation window. Default is 31
-sigcalc, --signature-calculation [frequencies|region_normalized]
Signature calculation: mutation frequencies
(default) or k-mer mutation counts
normalized by k-mer region counts
-siggroup, --signature-group [SIGNATURE|SAMPLE|CANCER_TYPE]
Header of the column to group signatures
calculation
-c, --cores INTEGER RANGE Number of cores to use in the computation.
By default it will use all the available
cores.
--seed INTEGER Seed to use in the simulations
--log-level [debug|info|warning|error|critical]
Verbosity of the logger
--concatenate Calculate clustering on concatenated genomic
regions (e.g., exons in coding sequences)
--clustplot Generate a needle plot with clusters for an
element
--qqplot Generate a quantile-quantile (QQ) plot for a
dataset
--gzip Gzip compress files
-h, --help Show this message and exit.
Copy sample data to the current folder:
[user@cn3335 ~]$ cp -r $ODCLUSTL_DATA/* .Now let's run oncodriveCLUSTL on the sample data. According to the the oncodriveCLUSTL documentation, "The first time that you run OncodriveCLUSTL with a given reference genome, it will download it from our servers. By default the downloaded datasets go to ~/.bgdata. If you want to move these datasets to another folder you have to define the system environment variable BGDATA_LOCAL with an export command."
[user@cn3335 ~]$ oncodriveclustl -i PAAD.tsv.gz -r cds.hg19.regions.gz -o test_output
2023-02-02 08:32:50,073 [110140] INFO root: OncodriveCLUSTL
2023-02-02 08:32:50,073 [110140] INFO root:
input_file: PAAD.tsv.gz
regions_file: cds.hg19.regions.gz
input_signature: None
output_directory: test_output
genome: hg19
element_mutations: 2
cluster_mutations: 2
concatenate: False
smooth_window: 11
cluster_window: 11
k-mer: 3
simulation_mode: mutation_centered
simulation_window: 31
n_simulations: 1000
signature_calculation: frequencies
signature_group: None
cores: 128
gzip: False
seed: None
2023-02-02 08:32:50,075 [110140] INFO root: Initializing OncodriveCLUSTL...
2023-02-02 08:32:50,077 [110140] WARNING root:
Running with default simulating, smoothing and clustering OncodriveCLUSTL parameters. Default parameters may not be optimal for your data.
Please, read Supplementary Methods to perform model selection for your data.
2023-02-02 08:32:50,079 [110140] WARNING root:
Signatures will be calculated as mutation frequencies: # mutated ref>alt k-mer counts / # total substitutions
Please, read Supplementary Methods to perform a more accurate signatures calculation
2023-02-02 08:32:50,080 [110140] INFO root: Parsing genomic regions and mutations...
2023-02-02 08:33:01,448 [110140] INFO root: Regions parsed
2023-02-02 08:33:01,639 [110140] INFO root: Mutations parsed
2023-02-02 08:33:01,714 [110140] INFO root: Validated elements in genomic regions: 20169
2023-02-02 08:33:01,715 [110140] INFO root: Validated elements with mutations: 5183
2023-02-02 08:33:01,716 [110140] INFO root: Total substitution mutations: 7913
2023-02-02 08:33:01,717 [110140] INFO root: Computing signature...
2023-02-02 08:33:05,327 [110140] INFO root: Signature computed
2023-02-02 08:33:05,349 [110140] INFO root: Calculating results 1456 elements...
2023-02-02 08:33:05,352 [110140] INFO root: Iteration 1 of 15
simulations: 100%|█████████████████████████████████| 3/3 [00:14<00:00, 4.84s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:36<00:25, 1.04it/s]
2023-02-02 08:34:59,477 [110140] INFO root: Iteration 2 of 15
simulations: 100%|█████████████████████████████████| 7/7 [00:13<00:00, 1.95s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:30<00:24, 1.11it/s]
2023-02-02 08:36:45,989 [110140] INFO root: Iteration 3 of 15
simulations: 100%|█████████████████████████████████| 5/5 [00:22<00:00, 4.46s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:17<00:20, 1.30it/s]
2023-02-02 08:38:29,579 [110140] INFO root: Iteration 4 of 15
simulations: 100%|█████████████████████████████████| 5/5 [00:18<00:00, 3.70s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:34<00:25, 1.07it/s]
2023-02-02 08:40:25,999 [110140] INFO root: Iteration 5 of 15
simulations: 100%|█████████████████████████████████| 7/7 [00:30<00:00, 4.41s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:38<00:26, 1.02it/s]
2023-02-02 08:42:39,555 [110140] INFO root: Iteration 6 of 15
simulations: 100%|███████████████████████████████| 12/12 [00:30<00:00, 2.54s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:16<00:20, 1.32it/s]
2023-02-02 08:44:30,511 [110140] INFO root: Iteration 7 of 15
simulations: 100%|█████████████████████████████████| 7/7 [00:15<00:00, 2.23s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:25<00:22, 1.19it/s]
2023-02-02 08:46:15,403 [110140] INFO root: Iteration 8 of 15
simulations: 100%|███████████████████████████████| 11/11 [00:17<00:00, 1.59s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:29<00:23, 1.13it/s]
2023-02-02 08:48:07,030 [110140] INFO root: Iteration 9 of 15
simulations: 100%|███████████████████████████████| 13/13 [00:51<00:00, 3.95s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:34<00:25, 1.07it/s]
2023-02-02 08:50:36,900 [110140] INFO root: Iteration 10 of 15
simulations: 100%|█████████████████████████████████| 9/9 [00:13<00:00, 1.53s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:41<00:27, 1.00s/it]
2023-02-02 08:52:35,327 [110140] INFO root: Iteration 11 of 15
simulations: 100%|█████████████████████████████████| 5/5 [00:14<00:00, 2.96s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:28<00:23, 1.14it/s]
2023-02-02 08:54:22,365 [110140] INFO root: Iteration 12 of 15
simulations: 100%|███████████████████████████████| 10/10 [00:20<00:00, 2.09s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:37<00:25, 1.04it/s]
2023-02-02 08:56:24,457 [110140] INFO root: Iteration 13 of 15
simulations: 100%|█████████████████████████████████| 7/7 [00:38<00:00, 5.56s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:28<00:23, 1.14it/s]
2023-02-02 08:58:35,490 [110140] INFO root: Iteration 14 of 15
simulations: 100%|███████████████████████████████| 10/10 [00:18<00:00, 1.85s/it]
post processing: 79%|██████████████████████▉ | 101/128 [01:33<00:24, 1.09it/s]
2023-02-02 09:00:31,237 [110140] INFO root: Iteration 15 of 15
simulations: 100%|█████████████████████████████████| 1/1 [00:05<00:00, 5.99s/it]
post processing: 45%|█████████████▎ | 57/128 [00:56<01:10, 1.01it/s]
2023-02-02 09:01:40,325 [110140] INFO root: Elements results calculated
2023-02-02 09:01:40,381 [110140] INFO root: Clusters results calculated
2023-02-02 09:01:40,383 [110140] INFO root: Finished