Experienced User Guide for Biowulf

Table of Contents

Introduction

Introducing Biowulf

SLURM Partitions/Queues

Filesystems

Software and Modules

Accessing Biowulf

Best Practices

Getting More Information

Quick Reference Data

To develop software, you must understand the problem as well as the desired solution. If you don’t understand the solution, you’re doing research. If you don’t understand the problem, you’re just mucking about with computers. Joe Leonard

Introduction

About Biowulf

Hello and thank you for your interest in Biowulf, the High-Performance Computing (HPC) cluster at the National Institutes of Health. Biowulf, with over 90,000 cores, 70 petabytes of storage, and an array of over 900 GPUs, is one of the world’s largest computing clusters devoted to biomedical research and is available to any researcher in the NIH IRP (Intramural Research Program) who is sponsored by their IRP Principal Investigator and pays the flat $40/month usage fee. It has a wide variety of software installed supporting sequence analysis, structural biology, computational chemistry, mathematical statistics, and image processing. Biowulf is supported by a team of nearly 20 systems administrators and support scientists who ensure the system’s stability and assist users with any questions they encounter in the course of their work. The goal of this document is to provide information about the use of Biowulf to researchers who already have experience with another HPC cluster of some form. Potential Biowulf users who don’t have any particular cluster experience may find the virtual orientations for Biowulf (see https://hpc.nih.gov/training/) a more gentle introduction to cluster computing. New users with specific questions are also welcome to contact the HPC staff via email at staff@hpc.nih.gov and seek answers directly.

HPC clusters generally come in two varieties. Some, like the Department of Energy clusters at Sandia National Lab or Oak Ridge National Lab, are built to run a single, large problem program as fast as possible. These clusters are optimized for processor and network speed and are called Capability clusters. Others, like Biowulf, are designed to have as many CPU cores and as much memory as is feasible. Their overall goal is to provide as many compute cycles as possible to as many users as possible in a unit time. These are known as Capacity clusters. To use an automotive metaphor, a Capability cluster is like a Ferrari or Maserati (or maybe a VW microbus with a Porsche turbocharged engine) that can get a few people someplace as fast as possible. A Capacity cluster, on the other hand, is more like an 18-wheel tractor trailer. They can still do 75 MPH on the highway, but their raison d’etre is to move a large quantity of freight from point A to point B efficiently. The practical upshot of Biowulf being a Capacity cluster is that it is shared between over 3,000 users, with over 500 of them actively running jobs on the compute nodes simultaneously.

Introducing Biowulf

As stated above, Biowulf is a Linux-based, heterogeneous HPC cluster hosted by the NIH Center for Information Technology (CIT) that is available to researchers in the NIH IRP. It is composed of a login node, called biowulf.nih.gov and thousands of compute nodes, among other things. Almost all the actual computing occurs on the compute nodes. The head node is primarily there as a gateway to the compute nodes, not as a compute resource itself. Access to the compute nodes is mediated by the SLURM resource manager/batch scheduler. Users cannot connect to nodes outside of the batch system unless the batch system has already placed one of their jobs on that particular node. Most batch jobs are allowed to run for up to 10 days by default. In our SLURM configuration, there are several partitions, or queues, configured, each with a different purpose.

SLURM Partitions/Queues

Some of the most important queues are:

norm: this is the default partition where jobs will run unless another partition is specified. It can access some or all of the cores on a single node, and it has nodes with 243 to 747 gigabytes of RAM.
gpu: the gpu partition is where all the GPUs are installed. To actually access a GPU, you will need to specify a GRES that identifies a number of GPUs and, optionally, the type of GPU needed.
largemem: this partition has a small number of nodes with larger quantities of RAM. Your job must allocate at least 350 gigabytes of RAM to access the partition, and some of the nodes have as much as 3 terabytes of memory installed.
multinode: this partition is reserved for users needing coordinated parallel jobs with more cores than exist on a single node. Normally, this means that the job is running under some variant of MPI. Two of the most common applications for this partition are molecular dynamics and Relion. Do NOT use it for jobs that can be run as a job array in norm, gpu, or largemem.
unlimited: this partition is the only one that will allow you to request more than a 10 day run-time. These are older nodes, and they don’t have the newest CPUs or the most RAM, but if you have a single-threaded motif-finding job that looks to require 30 days or more of run-time, this is the easiest way to get that long of a run. Note that you do still need to specify a time limit with the –time flag to sbatch, and there is no guarantee that the cluster will stay up to let your job run to completion. Scheduled maintenance on Biowulf can still cause jobs in unlimited to terminate early.

Filesystems

On Biowulf, there are several filesystems available to users that fulfill different roles in the Biowulf ecosystem. It is important to note that PII/PHI data is NOT allowed anywhere on Biowulf without consulting with the HPC staff and making special arrangements.

It is also important to note that for maximum performance efficiency no single directory or subdirectory should contain more than 5000 files. If you have a need to use multiple thousands of files, your jobs will obtain best performance by creating a nested directory structure of no more than 5000 files and directories per subdirectory. Operating on a single directory containing millions of files is one of the best ways to unnecessarily increase your job’s run-time.

The /home filesystem is intended to hold configuration files and short scripts for Biowulf users. It is not particularly fast to access, but it is designed to be highly available. It is strictly limited to 16 gigabytes per user. This amount cannot be increased. /home is backed up to tape and there are frequent snapshots taken on an hourly, daily, and weekly basis.

The /data filesystem, on the other hand, is designed to offer at least 100 gigabytes to each user, which can be increased upon providing a justification to the Biowulf staff. It is made up of high-performance filesystems and is shared between the head nodes and the compute nodes, so it is not as fast as local disk. /data is not backed up; users are responsible for backing up their data and executables to other systems that they have access to through their individual ICs. It does, however, have snapshots taken on a daily and weekly basis. It is assumed most of the data and programs that a user or research group are actively working with will be stored in /data. It is NOT archival or permanent storage for research data. As a rule of thumb, data should be migrated from /data to an archival solution provided by the IC after the publication of results based on that data.

/lscratch is scratch space available on the local SSD of each compute node. As such, the amount available varies from node to node, but it should be at least 400 gigabytes. By default, no /lscratch space is allocated with a SLURM job; it must be requested in the batch submission with a gres option. /lscratch space is allocated on a per-job basis. Each user has a quota assigned to their /lscratch subdirectory, so an allocation is guaranteed space on a per-job basis. It cannot be shared between users or jobs. In addition, the /lscratch subdirectory for a job is automatically deleted when a job completes in order to make the space available for the next job. It is the user’s responsibility to stage files to and from /lscratch at appropriate times during the execution of their job.

/scratch is a large filesystem that is only mounted on the head nodes. Users do not have a directory on it by default, but they may create subdirectories with arbitrary names under /scratch. Its primary purpose is to allow sharing large data sets between Biowulf users. Users are allowed to use up to 10 terabytes of storage on /scratch, but all files are deleted after 10 days, or sooner if the filesystem reaches 90% of capacity at any point. /scratch is NOT mounted on the compute nodes, and users should not attempt to run jobs out of it.

/tmp is a directory that is traditionally used in Linux to hold temporary or scratch files that are needed in the course of a program’s run. Users who are using codes that use more than a nominal amount of /tmp space should request that /lscratch space is allocated for their jobs and that the $TMPDIR environment variable is set to the appropriate subdirectory of /lscratch. Most of the free disk space on the compute nodes is allocated to /lscratch. Users who use /tmp instead of /lscratch and fill /tmp can make compute nodes crash, potentially causing all the jobs running on that node to be lost. If this happens repeatedly, your access to Biowulf may be suspended until you work with one of the support scientists to move your scratch space needs to another directory. /tmp and /lscratch have comparable performance.

Software and Modules

There are more applications installed on Biowulf than can be easily listed in a short orientation document such as this one. They are added to user environments and accessed through the module command. Basic documentation for most of the installed applications can be found through the search box of https://hpc.nih.gov. If you want to browse what is installed in a more organic manner, you can look at the module files in /usr/local/lmod/modulefiles, use the module spider command, or examine the actual install directories in /usr/local/apps. If the program you want to use isn’t installed, and you aren’t certain about how to install it, contact the HPC staff and we will be glad to take a look at your options for the application.

To access information about your Biowulf account, including what percentage of disk quota is in use and the status of submitted and completed jobs, the HPC dashboard is available at https://hpc.nih.gov/dashboard. It will show which /data directories you have access to, as well as the resources being consumed by running and completed jobs. It is also the easiest way to reenable your HPC account (if it is locked because it hasn’t been used in 60 days), or to request more space for one of your /data directories. Using the dashboard is also an effective way to determine why a job failed or to help tune the resource requests in a SLURM submission to ensure a job will be able to run to completion without excessive over-consumption of resources. If you prefer to work in a command-line environment, the job accounting data is available on Biowulf as well, by using the dashboard_cli command. Running the command with no options will provide help options for using it.

Accessing Biowulf

Connection Methods

Accessing Biowulf is possible via several methods, but it is important to note that access is only possible through the NIH network, either by being connected to the network on campus, or by using the VPN from off campus. Access is not possible from the “NIH Guest” WiFi SSID, even though it is on the NIH campus.

The easiest way to access Biowulf is through the ssh protocol. From the terminal on a Mac, or from either the command prompt or powershell on a Windows system, simply type ssh $username@biowulf.nih.gov. After entering your NIH login password (which will not echo) a session will begin on the Biowulf head node that lasts until you log out. From this session, you can create and submit batch jobs or start interactive sessions on the compute nodes. Strictly speaking, including your username may not be required if your local account’s username is the same as your NIH AD username, but on most Windows systems on campus, the string ‘NIH\’ will be automatically prepended to your username, which will break the login process. For Windows users, we actually recommend the use of the PuTTY suite of programs which offers a better interface to the ssh protocol than the raw ssh client provided with Windows. If it is not installed on your system, open a support ticket with your IC’s local desktop support organization.

For more complex access needs, such as requiring access to GUI-based applications, we provide Open On-Demand, which allows sessions with a full graphical desktop session in the convenience of your local web browser. To use Open On-Demand, navigate to https://hpcondemand.nih.gov and follow the prompts to start a session. Please note that you will need your PIV card to log into Open On-Demand. For more assistance with Open On-Demand, please send your questions or issues to staff@hpc.nih.gov

File Transfer

There are also multiple protocols possible for file transfers to and from Biowulf. The most important thing to remember about these methods is that most of them need to communicate with helix.nih.gov and not biowulf.nih.gov. scp, sftp, and rsync all work to transfer files as expected, with scp most useful for one or two files at a time, sftp for a few files, and rsync working efficiently on moderately-sized directory trees. Please note that attempting to connect to biowulf.nih.gov with scp or sftp will fail, and attempting to use rsync with it will fail after about 5 minutes of file transfer.

For large-scale file transfers (more than about 10 gigabytes), we recommend the use of Globus, as it is built to be very robust with regards to interrupted or slow connections. It can also be used to share data with collaborators who are not at NIH or who don’t have Biowulf accounts. There is more detail available about Globus at https://hpc.nih.gov/docs/globus. Finally, we offer hpcdrive, which is a mechanism to mount filesystems from Biowulf to local user workstations. This is most appropriate for quickly transferring small files or directories to a local workstation from Biowulf, or vice versa. It is also a useful mechanism to allow click and read access to HTML formatted reports generated by applications on Biowulf. More information about hpcdrive can be found at https://hpc.nih.gov/docs/hpcdrive.html .

Not Recommended

There are a small number of access methods for Biowulf that we explicitly do NOT recommend and advise against using. The first of these is the use of Filezilla for file transfer. There has been a problem with malware getting bundled with Mac and Windows downloads for Filezilla, so we strongly encourage our users to stay far away from it for security’s sake. We also discourage users from using WinSCP as a mechanism of accessing the shell on Biowulf or Helix. While it is possible, the interface is very crude, and it leaves no way to use text-based user interfaces. It’s a useful party trick if you know how to work in ed, but beyond that, it’s a clunky way to set up your runs. Finally, we don’t recommend the use of MobaXterm on Windows hosts as a primary X Windows interface to Biowulf. There are a few edge cases where it can be helpful, but we do not have the staffing to support it as a general solution. If you are not very experienced in working with X as your primary window manager, we recommend avoiding it.

Best Practices

Do's:

Watch your email when working on Biowulf: We use email as our primary mechanism of communication with users. The email address we use to contact you is the same as your Work Email seen in NED. We have also configured automated emails that get sent when there are potential problems with your Biowulf activity, such as continually restarting rsync after the Biowulf head node process killer terminates it. If these emails are ignored or not acted upon, your account may be locked or your ability to submit SLURM jobs may be interrupted until we establish communications with you and you indicate that you will not continue with the problematic behavior. It is worth noting that we get your email address from NED when you apply for a Biowulf account, but it is not automatically mirrored from NED after that time. If you change your NIH work email address, please contact us to make certain that email from your SLURM runs or the HPC staff will still be able to find you. Finally, when using the --mail-user option to sbatch, please make sure that your correct email is included in the option. It should not be set to user@nih.gov or anything other than your nih.gov email address. Please don’t set this without testing it and then submit a large swarm or job array. If it is not set correctly, the HPC staff will get at least one bounce message for every email you tried to send which tends to hammer our mail server.
Get familiar with extended utilities: We have written many small utilities that provide information about the Biowulf environment. For example freen provides information about free resources on Biowulf, and batchlim shows the limits placed on different batch queues. These can be found in one place with the bwulf meta-command but can also be run with just their short names as well. For more detail on these commands, run bwulf -h or see Biowulf Utilities.
Run your jobs in the right place: The /home directories are on a filesystem that is designed for robust access, not for high-performance. Also, your /home directory is strictly limited to 16 gigabytes. For most jobs, running out of your /data directory is a sensible option. If you are going to be generating a lot of random I/O or a lot of small files, you probably want to explore using /lscratch. In no event should you run your jobs out of /tmp or /scratch.
Use /lscratch for faster I/O: /lscratch is a local filesystem on each compute node that can be allocated with the sbatch or sinteractive command and that is automatically cleaned up at the end of a job. It is allocated on a per-job basis as /lscratch/$SLURM_JOBID, and cannot be shared between users or subjobs. To allocate /lscratch space for a job, use the --gres=lscratch:## flag to your job submission, where ## is the amount of disk space you need, in gigabytes. Please note that you will need to manually stage any data that you want to process in your /lscratch directory from another location, such as your /data directory and change to the allocated directory before starting your processing.
Use swarm instead of job arrays: Job arrays are a convenient way to keep several related subjobs together in a single SLURM submission but running them requires a certain level of shell scripting expertise. We have implemented a SLURM utility called swarm that allows you to run larger arrays of jobs by just entering the commands for each subjob on separate lines in a single file. It is an easier way to set up job arrays if your shell scripting is rusty. See the swarm man page for more details or go to https://hpc.nih.gov/apps/swarm.html.
Keep large files/directories on /data: Because your /home directory is strictly limited to 16 gigabytes, it’s best to limit what you keep there to small scripts and configuration files. Your source code and working data should be kept in your /data directory which starts at 100 gigabytes and can be expanded to terabytes if needed. While it is possible to use 10 terabytes at a time on the /scratch filesystem, this filesystem is not mounted to the compute nodes and is purged every two weeks. To keep the size of your /home directory under control, you may want to consider moving your ~/.cache and ~/.conda directories to your /data directory (and throwing a symlink to them from where they were in your /home directory). If you have accumulated a lot of files in either your /home or /data directories and you want to figure out what is eating the largest portion of your quota, the dust command will provide a tree view of which directories are the likely culprits.
Ask if you have questions: This is a lot of information to absorb. While we encourage experimentation as a way to learn how to make the most of your Biowulf account, the HPC staff are here to help and we will gladly answer any questions that we can about Biowulf usage. About to start a new project and not quite certain where to go first? We can offer consultations to help start things off on the right foot and avoid common pitfalls.

Don'ts:

Stress the scheduler: On a cluster the size of Biowulf, the scheduler has to work very hard just to keep up with all the running jobs and deciding which jobs should be run next. Every job submitted increases this load, as does every inquiry about what jobs are running. Because of this load, we only average one full scheduler cycle every 2 minutes. Information from squeue won’t be updated more often than that. Similarly, trying to launch multiple jobs all at once (such as with a shell loop over a globbed set of batch scripts) can cause stability problems with the scheduler. We request that you either use swarm to launch jobs with multiple sub-jobs or that you submit no more than 1 job per second with sbatch. Likewise, please do not run squeue more than once every two minutes, and avoid running it inside the watch command to monitor its output. Ideally, you should probably explore using the sjobs command to monitor your jobs. It provides the same information as squeue, but it gets its data from the dashboard server, which doesn’t place additional strain on the SLURM scheduler.
Run on the head node: We have a fairly beefy head node for Biowulf. It has 112 CPUs and over 470 gigabytes of RAM. Nevertheless, it gets very busy with over 600 users on it at most times. Even a simple awk command or python script can easily take up a big chunk of memory or I/O and leave other users with unresponsive prompts. We have seen a single samtools view be enough to cause problems at times. Therefore, we ask that you limit your work on the Biowulf head node to light editing, code compilation, job submission, and reading output files. This is enforced by a process killer script that terminates any process not owned by root that accumulates 5 minutes of CPU time. Additionally, users cannot exceed using 4 CPUs at a time on the head node. We have a second host, Helix, that mounts the same filesystems as Biowulf and that is intended for large-scale or interactive data transfer jobs and intensive file manipulations. wget or other download processes can be run there inside a screen or tmux session to make downloading large file sets from Internet hosts easier. We do request that downloads be limited to no more than 6 at a time in parallel.
Use salloc and srun: The traditional way to request an interactive allocation of CPU and memory on a SLURM cluster is to use salloc followed by srun to start an interactive shell on one or more of the compute nodes. In our experience, this process can be confusing for users and is prone to errors. We have developed a simpler script, called sinteractive, that combines the actions of these two programs into a single command and which simplifies the management of interactive sessions. Using salloc and srun to obtain interactive sessions is not supported on Biowulf; we strongly urge you to use sinteractive instead. Please note that there is a hard limit of 2 interactive sessions per user on Biowulf, no matter how the sessions are obtained.
Work around our SLURM wrappers: We’ve run SLURM for quite some time on Biowulf, and we have seen several common types of errors occur over and over again. Similarly, we have seen several error messages continue to confound users. It is also easy to insert typos into an sbatch command which the scheduler will accept, and happily refuse to ever run. To address these issues and to simplify our logging processes, we have put wrappers around sbatch and salloc. Please don’t try to run the raw binaries instead. If the wrappers are interfering with a workflow tool you are using, please let us know and we will work with you to adapt the wrappers.
Try to use sudo or su: We’ve all seen application installation instructions that tell you to run sudo to install a requisite library or the final application to a system directory. Please don’t attempt to do this on Biowulf. We don’t grant general user access to these commands, and if you try to use them, it will trigger alarms that our security team will have to investigate. We understand that you just the your latest version of the latest tool installed, and you think that you are helping by trying to do the installation yourself rather than putting in a ticket for it. Thank you for the thought, but if the installation instructions involve a sudo or su call, and you don’t know how to work around it, please just put in a ticket and we will help deal with it. Similarly, users don’t have write access to /usr/local. Any attempts to write software to this filesystem will fail.
Over-allocate for your jobs: Nobody likes to see their job be killed because it ran out of memory or time. However, overestimating the memory and time requirements for your jobs can result in them taking longer to be scheduled to run (remember: smaller jobs are scheduled before larger ones). So asking for 10 days of run-time instead of the 2 days that the job actually takes, or asking for 64 gigabytes of RAM when your program will almost certainly fit into 16, will probably make your jobs take longer to start running. Likewise, asking for 8 CPUs in an attempt to speed up a single-threaded application or asking for a GPU that the application isn’t written to use won’t help speed things up — this just wastes resources. Finally, don’t throw more CPUs at a large problem and assume it will scale efficiently. If you are making plans for a large series of multi-CPU runs, benchmarking the job at various CPU counts will help you to see how the job scales and identify the point where adding more CPUs adds more communication complexity without increasing the efficiency of the calculation to an acceptable level (generally around 70%).
Debug with swarms: When you are just starting out with swarm, start with small jobs and work your way up to the full-size problem you want to run. When a swarm fails, please stop and consider what went wrong before diving in, making a quick fix, and resubmitting it. If the problem was in a single line of the swarmfile or in a common run script that all lines reference, please don’t resubmit the swarm to test your fix. Test the fix with submission with a single sbatch submission and only then, if the fix works, resubmit your swarm. Only run debugging jobs in swarm if the error or problem is coming from the swarm program directly, such as problems with bundling or resource allocation. Even then, please simplify the swarmfile to the fewest number of tasks or the simplest possible job script that allows you to reproduce the problem. Also, while it may be tempting to use swarm for all your job submissions, please don’t use it to submit single-subjob runs. Yes, it may work, but it only makes it harder to debug things when they go wrong.

Getting More Information

The easiest way to get more information about the Biowulf cluster is to check out our website at https://hpc.nih.gov. While we don’t have documentation for every question, our goal is to put as much information up on that site as is possible. If your question isn’t found there, or if you know that it is so specific as to be a one-off kind of question, please feel free to contact the group’s staff at staff@hpc.nih.gov. We have a rotation schedule to make sure that someone is always covering that email address; reaching out to individual team members with problems runs the risk of sitting unseen in their mailboxes if they are out of the office or otherwise unavailable. Also, mailing the staff email address helps us route the ticket to the staff member with the most relevant experience to help you. Also, please do not initiate contact for a problem with us in Teams. Our workflow requires that a ticket be generated for each contact so that we can track issues with Biowulf, and dropping an initial request in Teams means that we will need to create a ticket manually. We are not averse to Teams consultations to see particular problems “in the wild,” but these are best scheduled after initial intake and triage for a given problem. Finally, we do have an open Virtual Walk-in consult via Teams every month. Information about this event is mailed to all Biowulf users in advance of it every month.

Quick Reference Data

Important Sites

Resource URL/Email

Main Website https://hpc.nih.gov

Support Email staff@hpc.nih.gov

HPC Training Information https://hpc.nih.gov/training/

SSH login host biowulf.nih.gov

Open On-Demand https://hpcondemand.nih.gov

File Transfer Host helix.nih.gov

Globus Documentation https://hpc.nih.gov/globus

hpcdrive documentation https://hpc.nih.gov/hpcdrive.html

Resource	URL/Email
Main Website	https://hpc.nih.gov
Support Email	staff@hpc.nih.gov
HPC Training Information	https://hpc.nih.gov/training/
SSH login host	biowulf.nih.gov
Open On-Demand	https://hpcondemand.nih.gov
File Transfer Host	helix.nih.gov
Globus Documentation	https://hpc.nih.gov/globus
hpcdrive documentation	https://hpc.nih.gov/hpcdrive.html

Important Limits

Item Limit

/home quota 16 GB (cannot be increased)

/data quota 100 GB default (expandable on request)

/scratch quota 10 TB

/scratch retention 10 days maximum

Max job run-time 10 days (unlimited partition can be more)

Interactive sessions max 2 per user

Head node CPU time limit 5 minutes per process before termination

Max CPUs on head node 4

Parallel downloads on helix 6 maximum

Item	Limit
/home quota	16 GB (cannot be increased)
/data quota	100 GB default (expandable on request)
/scratch quota	10 TB
/scratch retention	10 days maximum
Max job run-time	10 days (unlimited partition can be more)
Interactive sessions max	2 per user
Head node CPU time limit	5 minutes per process before termination
Max CPUs on head node	4
Parallel downloads on helix	6 maximum

Best Practices

Do’s

Watch email for automated notifications
Use utilities: freen, batchlim, bwulf, dust
Run jobs from /data or /lscratch (not /home, /tmp, or /scratch)
Allocate /lscratch with –gres=lscratch:## for faster I/O
Use swarm instead of job arrays
Keep large files on /data
- Move ~/.cache and ~/.conda to /data to save /home space
Contact staff@hpc.nih.gov with questions

Don’ts

Don’t stress the scheduler: Max 1 job/sec, don’t run squeue in watch, use sjobs instead
Don’t run on head node: 5-min CPU limit enforced; use Helix for intensive file transfer tasks
Don’t use salloc/srun: Use sinteractive instead (2 session max per user)
Don’t bypass SLURM wrappers: Don’t run raw sbatch/salloc binaries
Don’t use sudo/su: Triggers security alarms
Don’t over-allocate: Delays scheduling; benchmark multi-CPU jobs for ~70% efficiency
Don’t debug with swarms: Test fixes with single sbatch first