Impute5 Biowulf

IMPUTE 5 is a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE 5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model.

IMPUTE version 2 (also known as IMPUTE2) is also available on Biowulf along with qctool, gtool, and snptest

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load impute5
[+] Loading impute5 1.2.0  ...

[user@cn3144 ~]$ cp -a /usr/local/apps/impute5/1.2.0/test .
[user@cn3144 ~]$ impute5  \
  --h reference_xcf.bcf \
  --g target.bcf \
  --o imputed.bcf \
  --r 20:1200000-3800000 \
  --m chr20.b37.gmap.gz \
  --buffer-region 20:1000000-4000000

[IMPUTE5] Imputation of phased SNP array data from large reference panels
  * Authors             : Simone Rubinacci & Jonathan Marchini, 2023
  * Contact             : rubinacci.simone@gmail.com
  * Version             : IMPUTE5 v1.2.0 / commit = 6e3d41e / release = 2023-08-21
  * Citation            : PLOS Genetics 16(11): e1009049 (2020). DOI: https://doi.org/10.1371/journal.pgen.1009049
  * Licence             : IMPUTE5 is freely available only for academic use. To see rules for non-academic use, please read the LICENCE.
  * Run date            : 21/05/2026 - 11:43:54

Files:
  * Input phased array  : [target.bcf]
  * Reference panel     : [reference_xcf.bcf]
  * Output file         : [imputed.bcf]
  * Output format       : [BCF format | ZLIB compression | with CSI index]

IMPUTE5 parameters:
  * Imputation model    : [Reference panel imputation]
  * Input region        : [20:1000000-4000000]
  * Output region       : [20:1200000-3800000]
  * Sparse MAF          : [0.03125]
  * Recombination rates : [Given by genetic map]
  * Ploidy              : [Only diploid samples in region]

Model parameters:
  * Ne [eff. pop. size] : [1000000]
  * Imputation err. rate: [0.0001]

Selection parameters:
  * K pbwt              : [1500]
  * PBWT max depth      : [10]
  * PBWT min depth      : [2]
  * PBWT modulo (cM)    : [0.02]

Test statistics parameters:
  * Surfbat test        : [NO]
  * Surfbat MAF         : [0.01]
  * Surfbat INFO        : [0.3]

Output
  * FORMAT/GT           : [YES]
  * FORMAT/DS           : [YES]
  * FORMAT/GP           : [YES]
  * FORMAT/AP           : [NO]
  * FORMAT/SAP          : [NO]
  * Buffer              : [NO]
  * CSI index           : [YES]

Other parameters
  * Seed                : [42]
  * #Threads            : [1]

Initialisation:
  * XCF scanning        [done]          [Li=21904 | Lg= 2064 | Lt=0]    (0.03s)
  * XCF parsing         [done]          [Li=21904 | Lg= 2064 | Lt=0]    (0.02s)
  * Common transpose    [ref]           [var2hap]                       (0.00s)
  * Rare transpose      [ref]           [var2hap]       [nrar=55.0534]  (0.00s)
  * Genetic map         [n=82962]                                       (0.03s)
  * cM interpolation    [s=3068 / i=18836]                              (0.00s)

Samples statistics
  * Reference panel     [Nrh=580]       [L=21904]       [Lb=264]
        Imputation reg. [Lg=1800]       [Lr=14348]      [Lc=7556]
  * Target panel        [Nth=20]        [L=2064]        [Lb=264]
        Imputation reg. [Lg=1800]       [Lt=0]

Region statistics
  * Buffer region       [20:1000000-4000000]
  * Imputation region   [20:1200000-3800000]
  * Region span         [2.99 Mb]       [6.65 cM]

PBWT state selection

WARNING: Using the full reference panel. This might result in excessive runtime and memory usage. Ignore this message if the size of your reference panel is less or equal than the default value of the Kpbwt parameter

  * PBWT stats          [No PBWT]       [done]                          (0.00s)
  * Geno transpose      [ref]           [var2hap]                       (0.00s)

Imputing variants
  * HMM imputation      [done]  [#states=580.0 / mono=20.2% / %srd=5454]        (0.07s)

Finalization
  * BCF writing         [done]          [ZLIB / N=10 / L=21904] (0.07s)
  * Indexing output     [done]                                          (0.01s)

  * Output file         : [imputed.bcf]
  * Output format       : [BCF format | ZLIB compression | with CSI index]

Total running time [] (0s)

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. impute.sh). For example:

#!/bin/bash

module load impute/impute5

xcftools view --i reference.bcf -o reference_xcf.bcf -O sh -r 20 -T 8 -m 0.03125

impute5 --h reference_xcf.bcf --g target.bcf --o imputed.bcf --r 20:1200000-3800000 --m chr20.b37.gmap.gz --buffer-region 20:1000000-4000000

Submit this job using the Slurm sbatch command.

sbatch  [--mem=#] impute.sh