Experienced User Guide for Biowulf

To develop software, you must understand the problem as well as the desired solution. If you don’t understand the solution, you’re doing research. If you don’t understand the problem, you’re just mucking about with computers. Joe Leonard

Introduction
back to top

About Biowulf

Hello and thank you for your interest in Biowulf, the High-Performance Computing (HPC) cluster at the National Institutes of Health. Biowulf, with over 90,000 cores, 70 petabytes of storage, and an array of over 900 GPUs, is one of the world’s largest computing clusters devoted to biomedical research and is available to any researcher in the NIH IRP (Intramural Research Program) who is sponsored by their IRP Principal Investigator and pays the flat $40/month usage fee. It has a wide variety of software installed supporting sequence analysis, structural biology, computational chemistry, mathematical statistics, and image processing. Biowulf is supported by a team of nearly 20 systems administrators and support scientists who ensure the system’s stability and assist users with any questions they encounter in the course of their work. The goal of this document is to provide information about the use of Biowulf to researchers who already have experience with another HPC cluster of some form. Potential Biowulf users who don’t have any particular cluster experience may find the virtual orientations for Biowulf (see https://hpc.nih.gov/training/) a more gentle introduction to cluster computing. New users with specific questions are also welcome to contact the HPC staff via email at and seek answers directly.

HPC clusters generally come in two varieties. Some, like the Department of Energy clusters at Sandia National Lab or Oak Ridge National Lab, are built to run a single, large problem program as fast as possible. These clusters are optimized for processor and network speed and are called Capability clusters. Others, like Biowulf, are designed to have as many CPU cores and as much memory as is feasible. Their overall goal is to provide as many compute cycles as possible to as many users as possible in a unit time. These are known as Capacity clusters. To use an automotive metaphor, a Capability cluster is like a Ferrari or Maserati (or maybe a VW microbus with a Porsche turbocharged engine) that can get a few people someplace as fast as possible. A Capacity cluster, on the other hand, is more like an 18-wheel tractor trailer. They can still do 75 MPH on the highway, but their raison d’etre is to move a large quantity of freight from point A to point B efficiently. The practical upshot of Biowulf being a Capacity cluster is that it is shared between over 3,000 users, with over 500 of them actively running jobs on the compute nodes simultaneously.

Introducing Biowulf
back to top

As stated above, Biowulf is a Linux-based, heterogeneous HPC cluster hosted by the NIH Center for Information Technology (CIT) that is available to researchers in the NIH IRP. It is composed of a login node, called biowulf.nih.gov and thousands of compute nodes, among other things. Almost all the actual computing occurs on the compute nodes. The head node is primarily there as a gateway to the compute nodes, not as a compute resource itself. Access to the compute nodes is mediated by the SLURM resource manager/batch scheduler. Users cannot connect to nodes outside of the batch system unless the batch system has already placed one of their jobs on that particular node. Most batch jobs are allowed to run for up to 10 days by default. In our SLURM configuration, there are several partitions, or queues, configured, each with a different purpose.

SLURM Partitions/Queues

Some of the most important queues are:

Filesystems

On Biowulf, there are several filesystems available to users that fulfill different roles in the Biowulf ecosystem. It is important to note that PII/PHI data is NOT allowed anywhere on Biowulf without consulting with the HPC staff and making special arrangements.

It is also important to note that for maximum performance efficiency no single directory or subdirectory should contain more than 5000 files. If you have a need to use multiple thousands of files, your jobs will obtain best performance by creating a nested directory structure of no more than 5000 files and directories per subdirectory. Operating on a single directory containing millions of files is one of the best ways to unnecessarily increase your job’s run-time.

The /home filesystem is intended to hold configuration files and short scripts for Biowulf users. It is not particularly fast to access, but it is designed to be highly available. It is strictly limited to 16 gigabytes per user. This amount cannot be increased. /home is backed up to tape and there are frequent snapshots taken on an hourly, daily, and weekly basis.

The /data filesystem, on the other hand, is designed to offer at least 100 gigabytes to each user, which can be increased upon providing a justification to the Biowulf staff. It is made up of high-performance filesystems and is shared between the head nodes and the compute nodes, so it is not as fast as local disk. /data is not backed up; users are responsible for backing up their data and executables to other systems that they have access to through their individual ICs. It does, however, have snapshots taken on a daily and weekly basis. It is assumed most of the data and programs that a user or research group are actively working with will be stored in /data. It is NOT archival or permanent storage for research data. As a rule of thumb, data should be migrated from /data to an archival solution provided by the IC after the publication of results based on that data.

/lscratch is scratch space available on the local SSD of each compute node. As such, the amount available varies from node to node, but it should be at least 400 gigabytes. By default, no /lscratch space is allocated with a SLURM job; it must be requested in the batch submission with a gres option. /lscratch space is allocated on a per-job basis. Each user has a quota assigned to their /lscratch subdirectory, so an allocation is guaranteed space on a per-job basis. It cannot be shared between users or jobs. In addition, the /lscratch subdirectory for a job is automatically deleted when a job completes in order to make the space available for the next job. It is the user’s responsibility to stage files to and from /lscratch at appropriate times during the execution of their job.

/scratch is a large filesystem that is only mounted on the head nodes. Users do not have a directory on it by default, but they may create subdirectories with arbitrary names under /scratch. Its primary purpose is to allow sharing large data sets between Biowulf users. Users are allowed to use up to 10 terabytes of storage on /scratch, but all files are deleted after 10 days, or sooner if the filesystem reaches 90% of capacity at any point. /scratch is NOT mounted on the compute nodes, and users should not attempt to run jobs out of it.

/tmp is a directory that is traditionally used in Linux to hold temporary or scratch files that are needed in the course of a program’s run. Users who are using codes that use more than a nominal amount of /tmp space should request that /lscratch space is allocated for their jobs and that the $TMPDIR environment variable is set to the appropriate subdirectory of /lscratch. Most of the free disk space on the compute nodes is allocated to /lscratch. Users who use /tmp instead of /lscratch and fill /tmp can make compute nodes crash, potentially causing all the jobs running on that node to be lost. If this happens repeatedly, your access to Biowulf may be suspended until you work with one of the support scientists to move your scratch space needs to another directory. /tmp and /lscratch have comparable performance.

Software and Modules

There are more applications installed on Biowulf than can be easily listed in a short orientation document such as this one. They are added to user environments and accessed through the module command. Basic documentation for most of the installed applications can be found through the search box of https://hpc.nih.gov. If you want to browse what is installed in a more organic manner, you can look at the module files in /usr/local/lmod/modulefiles, use the module spider command, or examine the actual install directories in /usr/local/apps. If the program you want to use isn’t installed, and you aren’t certain about how to install it, contact the HPC staff and we will be glad to take a look at your options for the application.

To access information about your Biowulf account, including what percentage of disk quota is in use and the status of submitted and completed jobs, the HPC dashboard is available at https://hpc.nih.gov/dashboard. It will show which /data directories you have access to, as well as the resources being consumed by running and completed jobs. It is also the easiest way to reenable your HPC account (if it is locked because it hasn’t been used in 60 days), or to request more space for one of your /data directories. Using the dashboard is also an effective way to determine why a job failed or to help tune the resource requests in a SLURM submission to ensure a job will be able to run to completion without excessive over-consumption of resources. If you prefer to work in a command-line environment, the job accounting data is available on Biowulf as well, by using the dashboard_cli command. Running the command with no options will provide help options for using it.

Accessing Biowulf
back to top

Connection Methods

Accessing Biowulf is possible via several methods, but it is important to note that access is only possible through the NIH network, either by being connected to the network on campus, or by using the VPN from off campus. Access is not possible from the “NIH Guest” WiFi SSID, even though it is on the NIH campus.

The easiest way to access Biowulf is through the ssh protocol. From the terminal on a Mac, or from either the command prompt or powershell on a Windows system, simply type ssh $username@biowulf.nih.gov. After entering your NIH login password (which will not echo) a session will begin on the Biowulf head node that lasts until you log out. From this session, you can create and submit batch jobs or start interactive sessions on the compute nodes. Strictly speaking, including your username may not be required if your local account’s username is the same as your NIH AD username, but on most Windows systems on campus, the string ‘NIH\’ will be automatically prepended to your username, which will break the login process. For Windows users, we actually recommend the use of the PuTTY suite of programs which offers a better interface to the ssh protocol than the raw ssh client provided with Windows. If it is not installed on your system, open a support ticket with your IC’s local desktop support organization.

For more complex access needs, such as requiring access to GUI-based applications, we provide Open On-Demand, which allows sessions with a full graphical desktop session in the convenience of your local web browser. To use Open On-Demand, navigate to https://hpcondemand.nih.gov and follow the prompts to start a session. Please note that you will need your PIV card to log into Open On-Demand. For more assistance with Open On-Demand, please send your questions or issues to

File Transfer

There are also multiple protocols possible for file transfers to and from Biowulf. The most important thing to remember about these methods is that most of them need to communicate with helix.nih.gov and not biowulf.nih.gov. scp, sftp, and rsync all work to transfer files as expected, with scp most useful for one or two files at a time, sftp for a few files, and rsync working efficiently on moderately-sized directory trees. Please note that attempting to connect to biowulf.nih.gov with scp or sftp will fail, and attempting to use rsync with it will fail after about 5 minutes of file transfer.

For large-scale file transfers (more than about 10 gigabytes), we recommend the use of Globus, as it is built to be very robust with regards to interrupted or slow connections. It can also be used to share data with collaborators who are not at NIH or who don’t have Biowulf accounts. There is more detail available about Globus at https://hpc.nih.gov/docs/globus. Finally, we offer hpcdrive, which is a mechanism to mount filesystems from Biowulf to local user workstations. This is most appropriate for quickly transferring small files or directories to a local workstation from Biowulf, or vice versa. It is also a useful mechanism to allow click and read access to HTML formatted reports generated by applications on Biowulf. More information about hpcdrive can be found at https://hpc.nih.gov/docs/hpcdrive.html .

Not Recommended

There are a small number of access methods for Biowulf that we explicitly do NOT recommend and advise against using. The first of these is the use of Filezilla for file transfer. There has been a problem with malware getting bundled with Mac and Windows downloads for Filezilla, so we strongly encourage our users to stay far away from it for security’s sake. We also discourage users from using WinSCP as a mechanism of accessing the shell on Biowulf or Helix. While it is possible, the interface is very crude, and it leaves no way to use text-based user interfaces. It’s a useful party trick if you know how to work in ed, but beyond that, it’s a clunky way to set up your runs. Finally, we don’t recommend the use of MobaXterm on Windows hosts as a primary X Windows interface to Biowulf. There are a few edge cases where it can be helpful, but we do not have the staffing to support it as a general solution. If you are not very experienced in working with X as your primary window manager, we recommend avoiding it.

Best Practices
back to top

Do's:
Don'ts:
Getting More Information

The easiest way to get more information about the Biowulf cluster is to check out our website at https://hpc.nih.gov. While we don’t have documentation for every question, our goal is to put as much information up on that site as is possible. If your question isn’t found there, or if you know that it is so specific as to be a one-off kind of question, please feel free to contact the group’s staff at . We have a rotation schedule to make sure that someone is always covering that email address; reaching out to individual team members with problems runs the risk of sitting unseen in their mailboxes if they are out of the office or otherwise unavailable. Also, mailing the staff email address helps us route the ticket to the staff member with the most relevant experience to help you. Also, please do not initiate contact for a problem with us in Teams. Our workflow requires that a ticket be generated for each contact so that we can track issues with Biowulf, and dropping an initial request in Teams means that we will need to create a ticket manually. We are not averse to Teams consultations to see particular problems “in the wild,” but these are best scheduled after initial intake and triage for a given problem. Finally, we do have an open Virtual Walk-in consult via Teams every month. Information about this event is mailed to all Biowulf users in advance of it every month.

Quick Reference Data
back to top

Important Sites
Resource URL/Email
Main Website https://hpc.nih.gov
Support Email
HPC Training Information https://hpc.nih.gov/training/
SSH login host biowulf.nih.gov
Open On-Demand https://hpcondemand.nih.gov
File Transfer Host helix.nih.gov
Globus Documentation https://hpc.nih.gov/globus
hpcdrive documentation https://hpc.nih.gov/hpcdrive.html
Important Limits
Item Limit
/home quota 16 GB (cannot be increased)
/data quota 100 GB default (expandable on request)
/scratch quota 10 TB
/scratch retention 10 days maximum
Max job run-time 10 days (unlimited partition can be more)
Interactive sessions max 2 per user
Head node CPU time limit 5 minutes per process before termination
Max CPUs on head node 4
Parallel downloads on helix 6 maximum

Best Practices

Do’s

Don’ts