Boltzgen on Biowulf

BoltzGen, an all-atom generative model for designing proteins and peptides across all modalities to bind a wide range of biomolecular targets. BoltzGen builds strong structural reasoning capabilities about target-binder interactions into its generative design process. This is achieved by unifying design and structure prediction, resulting in a single model that also reaches state-of-the-art folding performance.

References:

Documentation
Important Notes

This application is suited to run on GPUs

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate using the GPU partition to allocate a interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive \
  --gres=gpu:a100:1,lscratch:20 \
  --mem=64G \
  --cpus-per-task=8
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load boltzgen
[+] Loading boltzgen  0.2.0  on cn3144
[+] Loading singularity  4.2.2  on cn3144
[user@cn3144 ~]$ boltzgen run -h
usage: boltzgen run [-h]
                    [--protocol {protein-anything,peptide-anything,protein-small_molecule,nanobody-anything,antibody-anything}]
                    [--output OUTPUT] [--config CONFIG [CONFIG ...]] [--devices DEVICES]
                    [--num_workers NUM_WORKERS] [--config_dir CONFIG_DIR]
                    [--use_kernels {auto,true,false}] [--moldir MOLDIR] [--reuse]
                    [--num_designs NUM_DESIGNS] [--diffusion_batch_size DIFFUSION_BATCH_SIZE]
                    [--design_checkpoints DESIGN_CHECKPOINTS [DESIGN_CHECKPOINTS ...]]
                    [--step_scale STEP_SCALE] [--noise_scale NOISE_SCALE] [--skip_inverse_folding]
                    [--inverse_fold_num_sequences INVERSE_FOLD_NUM_SEQUENCES]
                    [--inverse_fold_checkpoint INVERSE_FOLD_CHECKPOINT]
                    [--inverse_fold_avoid INVERSE_FOLD_AVOID] [--only_inverse_fold]
                    [--folding_checkpoint FOLDING_CHECKPOINT] [--affinity_checkpoint AFFINITY_CHECKPOINT]
                    [--budget BUDGET] [--alpha ALPHA] [--filter_biased {true,false}]
                    [--metrics_override METRICS_OVERRIDE [METRICS_OVERRIDE ...]]
                    [--additional_filters ADDITIONAL_FILTERS [ADDITIONAL_FILTERS ...]]
                    [--size_buckets SIZE_BUCKETS [SIZE_BUCKETS ...]]
                    [--refolding_rmsd_threshold REFOLDING_RMSD_THRESHOLD] [--no_subprocess]
                    [--steps {design,inverse_folding,design_folding,folding,affinity,analysis,filtering} [{design,inverse_folding,design_folding,folding,affinity,analysis,filtering} ...]]
                    [--force_download] [--models_token MODELS_TOKEN] [--cache CACHE]
                    design_spec [design_spec ...]

Boltzgen binder design pipeline

options:
  -h, --help            show this help message and exit

design specification:
  design_spec           Path(s) to design specification YAML file(s), or a directory containing prepared
                        configs

general configuration:
  --protocol {protein-anything,peptide-anything,protein-small_molecule,nanobody-anything,antibody-anything}
                        Protocol to use for the design. This determines default settings and in some cases
                        what steps are run. Default: protein-anything
  --output OUTPUT       Output directory for pipeline results
  --config CONFIG [CONFIG ...]
                        Override pipeline step configuration, in format  =
                        = ...(example: '--config folding num_workers=4 trainer.devices=4').
                        Can be used multiple times.
  --devices DEVICES     Number of devices to use. Default is all devices available.
  --num_workers NUM_WORKERS
                        Number of DataLoader worker processes.
  --config_dir CONFIG_DIR
                        Path to the directory of default config files. Default:
                        /opt/conda/lib/python3.12/site-packages/boltzgen/resources/config
  --use_kernels {auto,true,false}
                        Whether to use kernels. One of 'auto', 'true', or 'false'. Default: auto. If
                        'auto', will use kernels if the device capability is >= 8.
  --moldir MOLDIR       Path to the moldir. Default: huggingface:boltzgen/inference-data:mols.zip
  --reuse               Reuse existing results across all steps. Generate only as many new designs are
                        needed to achieve the specified total number of designs.

design:
  --num_designs NUM_DESIGNS
                        Number of total designs to generate. This commonly would be something like
                        10,000After generating 10,000 designs we then filter down to --budget many designs
                        in the filter step
  --diffusion_batch_size DIFFUSION_BATCH_SIZE
                        Number of diffusion samples to generate per trunk run. If not specified, this
                        defaults to 1 if --num-designs is less than 100, and 10 otherwise. Note that for
                        design tasks that randomly sample the binder length (or use randomness in other
                        ways), all designs generated in the same batch will share the same length. Having
                        a large diffusion batch size compared to the total number of designs to generate
                        will therefore not evenly sample the possible lengths.
  --design_checkpoints DESIGN_CHECKPOINTS [DESIGN_CHECKPOINTS ...]
                        Path to the boltzgen checkpoint(s). One or more checkpoints are supported. Just
                        specifying an individual path here will work.Each will be used for an equal
                        fraction of the designs. By default, two checkpoints are used. Default:
                        ['huggingface:boltzgen/boltzgen-1:boltzgen1_diverse.ckpt',
                        'huggingface:boltzgen/boltzgen-1:boltzgen1_adherence.ckpt']
  --step_scale STEP_SCALE
                        Fixed step scale to use (e.g. 1.8). Default is to use a schedule
  --noise_scale NOISE_SCALE
                        Fixed noise scale to use (e.g. 0.98). Default is to use a schedule

inverse folding:
  --skip_inverse_folding
                        Skip inverse folding step
  --inverse_fold_num_sequences INVERSE_FOLD_NUM_SEQUENCES
                        Number of sequences per backbone to generate in the inverse fold step. Default: 1
  --inverse_fold_checkpoint INVERSE_FOLD_CHECKPOINT
                        Path or huggingface repo and filename for the inverse fold checkpoint. Default:
                        huggingface:boltzgen/boltzgen-1:boltzgen1_ifold.ckpt
  --inverse_fold_avoid INVERSE_FOLD_AVOID
                        Disallowed residues as a string of one letter amino acid codes, e.g. 'KEC'. This
                        is implemented at the inverse fold step, so it only affects results if inverse
                        folding is enabled. Default: none for protein design, 'C' for peptide and
                        antibody/nanobody design. Pass an empty list if you want Cysteins to be generated
                        if you are using antibody/nanobody/peptide protocol
  --only_inverse_fold   Skip design step and only run inverse folding. Requires a fully specified
                        structure.

folding and affinity prediction:
  --folding_checkpoint FOLDING_CHECKPOINT
                        Path to the folding checkpoint. Default:
                        huggingface:boltzgen/boltzgen-1:boltz2_conf_final.ckpt
  --affinity_checkpoint AFFINITY_CHECKPOINT
                        Path to the affinity predictor checkpoint. Default:
                        huggingface:boltzgen/boltzgen-1:boltz2_aff.ckpt

filtering:
  --budget BUDGET       How many designs should be in the final diversity optimized set. This is used in
                        the filtering step.
  --alpha ALPHA         Trade-off for sequence diversity selection: 0.0=quality-only, 1.0=diversity-only.
                        Default is 0.01 (peptide-anything protocol) or 0.001 (other protocols).
  --filter_biased {true,false}
                        Remove amino-acid composition outliers (default caps on ALA/GLY/GLU/LEU/VAL).
                        Default: true.
  --metrics_override METRICS_OVERRIDE [METRICS_OVERRIDE ...]
                        Per-metric inverse-importance weights for ranking. Format: metric_name=weight
                        (e.g., plip_hbonds_refolded=4 delta_sasa_refolded=2). A larger value down-weights
                        that metric's rank. Use 'metric_name=none' to remove a metric.
  --additional_filters ADDITIONAL_FILTERS [ADDITIONAL_FILTERS ...]
                        Extra hard filters. Format: feature>threshold or feature0.3' 'design_GLY<0.2'). Use '>' if higher is better, '<' if lower is
                        better. Make sure to single-quote the strings so your shell doesn't get confused
                        by < and > characters.
  --size_buckets SIZE_BUCKETS [SIZE_BUCKETS ...]
                        Optional constraint for maximum number of designs in size ranges. Format: min-
                        max:count (e.g., 10-20:5 20-30:10 30-40:5).
  --refolding_rmsd_threshold REFOLDING_RMSD_THRESHOLD
                        Threshold used for RMSD-based filters (lower is better).

execution options:
  --no_subprocess       Run each step in the main process. Will cause issues when devices >1.
  --steps {design,inverse_folding,design_folding,folding,affinity,analysis,filtering} [{design,inverse_folding,design_folding,folding,affinity,analysis,filtering} ...]
                        Run only the specified pipeline steps (default: run all steps)

model and data download options:
  --force_download      Force a (re)-download of models and data.
  --models_token MODELS_TOKEN
                        Secret token to use for our models hosting service (Hugging Face). Default: None
  --cache CACHE         Directory where downloaded models will be stored. Default: ~/.cache

This script orchestrates work. It sets up an output directory with yaml files of pipeline steps that need to be run, and launches processes that run the pipeline steps.

Mainly it:
1) **Writes to yaml files** when `configure_command(...)` is executed
   - For each `PipelineStep`, the resolved Hydra config is written to
     `OUTPUT/config/.yaml`.
   - A manifest `OUTPUT/steps.yaml` is also written, listing the enabled steps
     and their config files in execution order.

2) **Executes from YAML** when `execute_command(...)` is executed
   - Each step is launched **as a subprocess** (`python main.py `)
     unless `--no_subprocess` is set (not the default).
   - If `--no_subprocess` is specified, the config is instantiated in-process
     and the `Task.run(...)` method is called directly.

The actual code that is exectued in each pipeline step is found in `main.py` which a wrapper for running the .run() function of our `Task` class.
If you run the pipeline (for example via `boltzgen run design_spec.yaml ...`) then this function reads the yaml files of the individual pipeline steps and executes the pipeline steps.

The possible tasks (and code files you want to inspect to understand what they are running):
    - Predict src/boltzgen/task/predict/predict.py (GPU: Running BoltzGen diffusion, inverse folding, refolding, designfolding, or affinity prediction)
    - Analyze src/boltzgen/task/analyze/analyze.py (CPU: Compute CPU Metrics and aggregate metrics from GPU steps)
    - Filter src/boltzgen/task/filter/filter.py (CPU: Very fast (20s) computes ranking and writes final output files)
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Example Interactive Session Command:

[user@biowulf]$ sinteractive \
  --gres=gpu:a100:1,lscratch:20 \
  --mem=64G \
  --cpus-per-task=8 
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load boltzgen
[+] Loading boltzgen  0.2.0  on cn3144
[+] Loading singularity  4.2.2  on cn3144
[user@cn3144 ~]$ cd /data/$USER 
[user@cn3144 ~]$ cp -a $EXAMPLES . 
[user@cn3144 ~]$ boltzgen run example/vanilla_protein/1g13prot.yaml \
  --output workbench/test_run \
  --protocol protein-anything \
  --num_designs 10 \
  --budget 2 \
  --cache /lscratch/$SLURM_JOB_ID 
  
=== Configuring pipeline ===
mols.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 391M/391M [00:02<00:00, 152MB/s]
Using dataset artifact: /lscratch/7082051/datasets--boltzgen--inference-data/snapshots/c3d36fd276e9caf098c75d4113c6d5eb320b1a4c/mols.zip
Creating output directory: workbench/test_run
************** Checking design spec: example/vanilla_protein/1g13prot.yaml **************
Total designed residues: 95
Design specification visualization is written to workbench/test_run/1g13prot.cif
*****************************************************************************************

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. test.sh). For example:

#!/bin/bash
module load boltzgen
cd /data/$USER

export CUDA_VISIBLE_DEVICES=0

boltzgen run example/vanilla_protein/1g13prot.yaml \
  --output workbench/test_run \
  --protocol protein-anything \
  --num_designs 10 \
  --budget 2 \
  --cache /lscratch/$SLURM_JOB_ID

Submit this job using the Slurm sbatch command.

sbatch --partition=gpu --cpus-per-task=8 --gres=gpu:a100:1,lscratch:20 test.sh