A draft genome scaffolder that uses multiple reference genomes in a graph-based approach.
Features
medusa --help
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load medusa
[user@cn3144 ~]$ cp -r $MEDUSA_TEST_DATA/* .
[user@cn3144 ~]$ medusa --help
medusa --help
Medusa version 1.6
usage: java -jar medusa.jar -i inputfile -v
available options:
-d OPTIONAL PARAMETER;The option *-d*
allows for the estimation of the
distance between pairs of contigs
based on the reference genome(s):
in this case the scaffolded contigs
will be separated by a number of N
characters equal to this estimate.
The estimated distances are also
saved in the
_distanceTable file.
By default the scaffolded contigs
are separated by 100 Ns
-f <> OPTIONAL PARAMETER; The option *-f*
is optional and indicates the path
to the comparison drafts folder
-gexf OPTIONAL PARAMETER;Conting network
and path cover are given in gexf
format.
-h Print this help and exist.
-i <> REQUIRED PARAMETER;The option *-i*
indicates the name of the target
genome file.
-n50 <> OPTIONAL PARAMETER; The option
*-n50* allows the calculation of
the N50 statistic on a FASTA file.
In this case the usage is the
following: java -jar medusa.jar
-n50 . All the
other options will be ignored.
-o <> OPTIONAL PARAMETER; The option *-o*
indicates the name of output fasta
file.
-random <> OPTIONAL PARAMETER;The option
*-random* is available (not
required). This option allows the
user to run a given number of
cleaning rounds and keep the best
solution. Since the variability is
small 5 rounds are usually
sufficient to find the best score.
-scriptPath <> OPTIONAL PARAMETER; The folder
containing the medusa scripts.
Default value: medusa_scripts
-v RECOMMENDED PARAMETER; The option
*-v* (recommended) print on console
the information given by the
package MUMmer. This option is
strongly suggested to understand if
MUMmer is not running properly.
-w2 OPTIONAL PARAMETER;The option *-w2*
is optional and allows for a
sequence similarity based weighting
scheme. Using a different weighting
scheme may lead to better results.
[user@cn3144 ~]$ medusa -f reference_genomes/ -i Rhodobacter_target.fna -v
INPUT FILE:Rhodobacter_target.fna
------------------------------------------------------------------------------------------------------------------------
Running MUMmer...done.
------------------------------------------------------------------------------------------------------------------------
Building the network...done.
------------------------------------------------------------------------------------------------------------------------
Cleaning the network...done.
------------------------------------------------------------------------------------------------------------------------
Scaffolds File saved: Rhodobacter_target.fnaScaffold.fasta
------------------------------------------------------------------------------------------------------------------------
Number of scaffolds: 78 (singletons = 32, multi-contig scaffold = 46)
from 564 initial fragments.
Total length of the jointed fragments: 4224838
Computing N50 on 78 sequences.
N50: 143991.0
----------------------
Summary File saved: Rhodobacter_target.fna_SUMMARY
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Create a batch input file (e.g. medusa.sh). For example:
#!/bin/bash set -e module load medusa medusa -f reference_genomes/ -i Rhodobacter_target.fna -v
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=2 --mem=2g medusa.sh
Create a swarmfile (e.g. medusa.swarm). For example:
cd dir1;medusa -f reference_genomes/ -i 1_target.fna -v cd dir2;medusa -f reference_genomes/ -i 2_target.fna -v cd dir3;medusa -f reference_genomes/ -i 3_target.fna -v cd dir4;medusa -f reference_genomes/ -i 4_target.fna -v
Submit this job using the swarm command.
swarm -f medusa.swarm [-g #] [-t #] --module medusawhere
| -g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
| -t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
| --module medusa | Loads the medusa module for each subjob in the swarm |