Ollama on Biowulf

Quick Links

Documentation

Notes

Interactive job

Batch job

Swarm of jobs

Ollama is a command line tool that allows users to run LLMs locally.

It can be used in many ways: interactive shell, API, Python library.

It contains pre-built models that can be easily used in a variety of applications, including Llama4, Mistral and Gemma.

Will use a GPU if there is one, otherwise will fallback to CPU.

Documentation

ollama Main Site

Important Notes

llama is in early user testing phase - not all functionality is guaranteed to work. Contact staff@hpc.nih.gov with any questions.

ollama module can benefit most from GPUs, it technically possible to run on cpu partitions, but performance will be very slow for all except the smallest models.
Important environment variables: $OLLAMA_HOST
By default the Ollama models get pulled to user's home directory at ~/.ollama, which can fill up the home directory quota quickly. For that reason, we are setting the OLLAMA_MODELS environment variable to /data/$USER/ollama. If you want to customize it, please define it after loading the module, and before start the ollama server:
```
		    module load ollama;
		    export OLLAMA_MODELS = "your_dir";
		    ollama_start
		    source $SLURM_JOB_ID/ollama.sh
		    
```
When selecting a model, pay attention to the SIZE of the model. It should fit in the total GPU memories (the GPU memory is the VRAM per GPU type, it is NOT the --mem of the job request) of the job that starts the Ollama server.

Hardware requirements

Quantization considerations: 4-bit quantization reduces memory to ~25% of original. So it is highly recommended to use.

Model Size	VRAM (FP16)	VRAM (4-bit)	GPU type
1–3B	4-6GB	~2GB	K80,P100,V100,V100x,A100
7–8B	14-16GB	~6-8GB	P100,V100,V100x,A100
13-14B	26-28GB	~12-16GB	V100x,A100
70B+	140GB+	~35-40GB	A100(4-bit)

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --gres=gpu:1,lscratch:10 --constraint="gpuv100|gpuv100x|gpua100" -c 8 --mem=10g --tunnel 
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load ollama

[user@cn3144 ~]$ cd /data/$USER/

[user@cn3144 ~]$ ollama_start
Running ollama on localhost:xxxxx

######################################
export OLLAMA_HOST=localhost:xxxxx
######################################

[user@cn3114 ~]$ export OLLAMA_HOST=localhost:xxxxx # or "source $SLURM_JOB_ID/ollama.sh"
[user@cn3114 ~]$ ollama list
[user@cn3114 ~]$ ollama pull gemma3:1b
[user@cn3114 ~]$ ollama run gemma3:1b
###enter prompts
what is long read sequencing
[user@cn3114 ~]$ ###runs the gemma3:1b with the prompt and passes the response into a file called response.txt
[user@cn3114 ~]$ ollama run gemma3:1b what is long read sequencing > response.txt
[user@cn3114 ~]$ ollama_stop
Terminated

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. ollama_job.sh). For example:

#!/bin/bash
set -e
module load ollama
cd /data/$USER
ollama_start
sleep 2
source $SLURM_JOB_ID/ollama.sh
ollama run gemma3:1b what is long read sequencing > response.txt
ollama_stop

Submit this job using the Slurm sbatch command.

sbatch --partition=gpu --gres=gpu:1,lscratch:10 --constraint="gpuv100|gpuv100x|gpua100" -c 8 --mem=10g ollama_job.sh