BoltzGen, an all-atom generative model for designing proteins and peptides across all modalities to bind a wide range of biomolecular targets. BoltzGen builds strong structural reasoning capabilities about target-binder interactions into its generative design process. This is achieved by unifying design and structure prediction, resulting in a single model that also reaches state-of-the-art folding performance.
This application is suited to run on GPUs
Allocate using the GPU partition to allocate a interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive \
--gres=gpu:a100:1,lscratch:20 \
--mem=64G \
--cpus-per-task=8
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load boltzgen
[+] Loading boltzgen 0.2.0 on cn3144
[+] Loading singularity 4.2.2 on cn3144
[user@cn3144 ~]$ boltzgen run -h
usage: boltzgen run [-h]
[--protocol {protein-anything,peptide-anything,protein-small_molecule,nanobody-anything,antibody-anything}]
[--output OUTPUT] [--config CONFIG [CONFIG ...]] [--devices DEVICES]
[--num_workers NUM_WORKERS] [--config_dir CONFIG_DIR]
[--use_kernels {auto,true,false}] [--moldir MOLDIR] [--reuse]
[--num_designs NUM_DESIGNS] [--diffusion_batch_size DIFFUSION_BATCH_SIZE]
[--design_checkpoints DESIGN_CHECKPOINTS [DESIGN_CHECKPOINTS ...]]
[--step_scale STEP_SCALE] [--noise_scale NOISE_SCALE] [--skip_inverse_folding]
[--inverse_fold_num_sequences INVERSE_FOLD_NUM_SEQUENCES]
[--inverse_fold_checkpoint INVERSE_FOLD_CHECKPOINT]
[--inverse_fold_avoid INVERSE_FOLD_AVOID] [--only_inverse_fold]
[--folding_checkpoint FOLDING_CHECKPOINT] [--affinity_checkpoint AFFINITY_CHECKPOINT]
[--budget BUDGET] [--alpha ALPHA] [--filter_biased {true,false}]
[--metrics_override METRICS_OVERRIDE [METRICS_OVERRIDE ...]]
[--additional_filters ADDITIONAL_FILTERS [ADDITIONAL_FILTERS ...]]
[--size_buckets SIZE_BUCKETS [SIZE_BUCKETS ...]]
[--refolding_rmsd_threshold REFOLDING_RMSD_THRESHOLD] [--no_subprocess]
[--steps {design,inverse_folding,design_folding,folding,affinity,analysis,filtering} [{design,inverse_folding,design_folding,folding,affinity,analysis,filtering} ...]]
[--force_download] [--models_token MODELS_TOKEN] [--cache CACHE]
design_spec [design_spec ...]
Boltzgen binder design pipeline
options:
-h, --help show this help message and exit
design specification:
design_spec Path(s) to design specification YAML file(s), or a directory containing prepared
configs
general configuration:
--protocol {protein-anything,peptide-anything,protein-small_molecule,nanobody-anything,antibody-anything}
Protocol to use for the design. This determines default settings and in some cases
what steps are run. Default: protein-anything
--output OUTPUT Output directory for pipeline results
--config CONFIG [CONFIG ...]
Override pipeline step configuration, in format =
= ...(example: '--config folding num_workers=4 trainer.devices=4').
Can be used multiple times.
--devices DEVICES Number of devices to use. Default is all devices available.
--num_workers NUM_WORKERS
Number of DataLoader worker processes.
--config_dir CONFIG_DIR
Path to the directory of default config files. Default:
/opt/conda/lib/python3.12/site-packages/boltzgen/resources/config
--use_kernels {auto,true,false}
Whether to use kernels. One of 'auto', 'true', or 'false'. Default: auto. If
'auto', will use kernels if the device capability is >= 8.
--moldir MOLDIR Path to the moldir. Default: huggingface:boltzgen/inference-data:mols.zip
--reuse Reuse existing results across all steps. Generate only as many new designs are
needed to achieve the specified total number of designs.
design:
--num_designs NUM_DESIGNS
Number of total designs to generate. This commonly would be something like
10,000After generating 10,000 designs we then filter down to --budget many designs
in the filter step
--diffusion_batch_size DIFFUSION_BATCH_SIZE
Number of diffusion samples to generate per trunk run. If not specified, this
defaults to 1 if --num-designs is less than 100, and 10 otherwise. Note that for
design tasks that randomly sample the binder length (or use randomness in other
ways), all designs generated in the same batch will share the same length. Having
a large diffusion batch size compared to the total number of designs to generate
will therefore not evenly sample the possible lengths.
--design_checkpoints DESIGN_CHECKPOINTS [DESIGN_CHECKPOINTS ...]
Path to the boltzgen checkpoint(s). One or more checkpoints are supported. Just
specifying an individual path here will work.Each will be used for an equal
fraction of the designs. By default, two checkpoints are used. Default:
['huggingface:boltzgen/boltzgen-1:boltzgen1_diverse.ckpt',
'huggingface:boltzgen/boltzgen-1:boltzgen1_adherence.ckpt']
--step_scale STEP_SCALE
Fixed step scale to use (e.g. 1.8). Default is to use a schedule
--noise_scale NOISE_SCALE
Fixed noise scale to use (e.g. 0.98). Default is to use a schedule
inverse folding:
--skip_inverse_folding
Skip inverse folding step
--inverse_fold_num_sequences INVERSE_FOLD_NUM_SEQUENCES
Number of sequences per backbone to generate in the inverse fold step. Default: 1
--inverse_fold_checkpoint INVERSE_FOLD_CHECKPOINT
Path or huggingface repo and filename for the inverse fold checkpoint. Default:
huggingface:boltzgen/boltzgen-1:boltzgen1_ifold.ckpt
--inverse_fold_avoid INVERSE_FOLD_AVOID
Disallowed residues as a string of one letter amino acid codes, e.g. 'KEC'. This
is implemented at the inverse fold step, so it only affects results if inverse
folding is enabled. Default: none for protein design, 'C' for peptide and
antibody/nanobody design. Pass an empty list if you want Cysteins to be generated
if you are using antibody/nanobody/peptide protocol
--only_inverse_fold Skip design step and only run inverse folding. Requires a fully specified
structure.
folding and affinity prediction:
--folding_checkpoint FOLDING_CHECKPOINT
Path to the folding checkpoint. Default:
huggingface:boltzgen/boltzgen-1:boltz2_conf_final.ckpt
--affinity_checkpoint AFFINITY_CHECKPOINT
Path to the affinity predictor checkpoint. Default:
huggingface:boltzgen/boltzgen-1:boltz2_aff.ckpt
filtering:
--budget BUDGET How many designs should be in the final diversity optimized set. This is used in
the filtering step.
--alpha ALPHA Trade-off for sequence diversity selection: 0.0=quality-only, 1.0=diversity-only.
Default is 0.01 (peptide-anything protocol) or 0.001 (other protocols).
--filter_biased {true,false}
Remove amino-acid composition outliers (default caps on ALA/GLY/GLU/LEU/VAL).
Default: true.
--metrics_override METRICS_OVERRIDE [METRICS_OVERRIDE ...]
Per-metric inverse-importance weights for ranking. Format: metric_name=weight
(e.g., plip_hbonds_refolded=4 delta_sasa_refolded=2). A larger value down-weights
that metric's rank. Use 'metric_name=none' to remove a metric.
--additional_filters ADDITIONAL_FILTERS [ADDITIONAL_FILTERS ...]
Extra hard filters. Format: feature>threshold or feature0.3' 'design_GLY<0.2'). Use '>' if higher is better, '<' if lower is
better. Make sure to single-quote the strings so your shell doesn't get confused
by < and > characters.
--size_buckets SIZE_BUCKETS [SIZE_BUCKETS ...]
Optional constraint for maximum number of designs in size ranges. Format: min-
max:count (e.g., 10-20:5 20-30:10 30-40:5).
--refolding_rmsd_threshold REFOLDING_RMSD_THRESHOLD
Threshold used for RMSD-based filters (lower is better).
execution options:
--no_subprocess Run each step in the main process. Will cause issues when devices >1.
--steps {design,inverse_folding,design_folding,folding,affinity,analysis,filtering} [{design,inverse_folding,design_folding,folding,affinity,analysis,filtering} ...]
Run only the specified pipeline steps (default: run all steps)
model and data download options:
--force_download Force a (re)-download of models and data.
--models_token MODELS_TOKEN
Secret token to use for our models hosting service (Hugging Face). Default: None
--cache CACHE Directory where downloaded models will be stored. Default: ~/.cache
This script orchestrates work. It sets up an output directory with yaml files of pipeline steps that need to be run, and launches processes that run the pipeline steps.
Mainly it:
1) **Writes to yaml files** when `configure_command(...)` is executed
- For each `PipelineStep`, the resolved Hydra config is written to
`OUTPUT/config/.yaml`.
- A manifest `OUTPUT/steps.yaml` is also written, listing the enabled steps
and their config files in execution order.
2) **Executes from YAML** when `execute_command(...)` is executed
- Each step is launched **as a subprocess** (`python main.py `)
unless `--no_subprocess` is set (not the default).
- If `--no_subprocess` is specified, the config is instantiated in-process
and the `Task.run(...)` method is called directly.
The actual code that is exectued in each pipeline step is found in `main.py` which a wrapper for running the .run() function of our `Task` class.
If you run the pipeline (for example via `boltzgen run design_spec.yaml ...`) then this function reads the yaml files of the individual pipeline steps and executes the pipeline steps.
The possible tasks (and code files you want to inspect to understand what they are running):
- Predict src/boltzgen/task/predict/predict.py (GPU: Running BoltzGen diffusion, inverse folding, refolding, designfolding, or affinity prediction)
- Analyze src/boltzgen/task/analyze/analyze.py (CPU: Compute CPU Metrics and aggregate metrics from GPU steps)
- Filter src/boltzgen/task/filter/filter.py (CPU: Very fast (20s) computes ranking and writes final output files)
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Example Interactive Session Command:
[user@biowulf]$ sinteractive \ --gres=gpu:a100:1,lscratch:20 \ --mem=64G \ --cpus-per-task=8 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load boltzgen [+] Loading boltzgen 0.2.0 on cn3144 [+] Loading singularity 4.2.2 on cn3144 [user@cn3144 ~]$ cd /data/$USER [user@cn3144 ~]$ cp -a $EXAMPLES . [user@cn3144 ~]$ boltzgen run example/vanilla_protein/1g13prot.yaml \ --output workbench/test_run \ --protocol protein-anything \ --num_designs 10 \ --budget 2 \ --cache /lscratch/$SLURM_JOB_ID === Configuring pipeline === mols.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 391M/391M [00:02<00:00, 152MB/s] Using dataset artifact: /lscratch/7082051/datasets--boltzgen--inference-data/snapshots/c3d36fd276e9caf098c75d4113c6d5eb320b1a4c/mols.zip Creating output directory: workbench/test_run ************** Checking design spec: example/vanilla_protein/1g13prot.yaml ************** Total designed residues: 95 Design specification visualization is written to workbench/test_run/1g13prot.cif *****************************************************************************************
Create a batch input file (e.g. test.sh). For example:
#!/bin/bash module load boltzgen cd /data/$USER export CUDA_VISIBLE_DEVICES=0 boltzgen run example/vanilla_protein/1g13prot.yaml \ --output workbench/test_run \ --protocol protein-anything \ --num_designs 10 \ --budget 2 \ --cache /lscratch/$SLURM_JOB_ID
Submit this job using the Slurm sbatch command.
sbatch --partition=gpu --cpus-per-task=8 --gres=gpu:a100:1,lscratch:20 test.sh