High Performance Computing

Overview of the Katahdin HPC system:

The main High Performance Computing resource that the UMS Advanced Computing Group provides is a growing set of computers (nodes) and storage systems linked together with a high speed, low latency Infiniband network. The general term used for this type of system is a "Compute Cluster" made up of a Head Node, Login Nodes, Compute Nodes and a Storage System. Compute nodes generally have large numbers of CPU cores and large amounts of memory. Programs can be run on individual cores on a single node or they can run on multiple nodes using hundreds of cores at the same time. There are also two General Purpose GPU nodes available that each have eight GPUs that can be used together or on individual programs. 

All of the compute nodes are managed by a central Resource Manager and Scheduler called SLURM. SLURM keeps track of all of the resources of the cluster and the jobs that are run on the cluster. In order to run a program on the cluster, a "job" needs to be submitted. A "job" is a request for resources (for instance what type of nodes to use,  how many nodes, CPUS, amount of Memory, and how much time these are needed), along with a command to run a program that will use those resources. The different types of nodes are grouped in SLURM entities called "partitions".

Penobscot HPC system (beta):

The Penobscot HPC system is our next generation cluster setup. The most noticeable difference between Penobscot and Katahdin is how you can interact with the cluster. While traditional terminal-based SSH connections are still possible, Penobscot also runs Open OnDemand that allows you to do everything from within a web browser. This system also has our three newest GPU nodes with 9 new GPUs. We will be transitioning nodes from the Katahdin cluster to Penobscot. This system is still under testing but we are allowing willing people to help with that by using the system. The basic software is installed, still using the same "module" program to manage environments. We will be adding software regularly.

Types of nodes and their SLURM partitions:

There are a few types of nodes/partitions available:

CPU Nodes (just on Katahdin unless Penobscot is specified):

GPU Nodes: total of 25 GPUs in 5 systems

How to use the HPC resources:

SSH:

Almost all of the HPC systems that the ACG operate run Linux and the main interface to the cluster is through a SSH (Secure Shell) connection to one of the Login Nodes. This provides a command-line interface to manage programs on the cluster. 

VNC:

Another way to interact with the cluster is using VNC Desktops. Each cluster account can set up a VNC Server which acts like a Linux Desktop in a window. In order to connect to a VNC Server a SSH connection is still required. The SSH connection is used to "Tunnel" the VNC communication, thus providing a secure encrypted connection. By using the VNC desktop, graphical programs can be run easily and with good performance. VNC also provides a persistent interface so that you can disconnect and then reconnect and all windows and programs will be the same as when you left it.

Open OnDemand (Penobscot): 

Open OnDemand makes all cluster interaction possible through a web browser. Login to the login node(s), manage files, submit jobs, run interactive programs like RStudio, Jupyter, Matlab and a terminal directly on nodes (including GPU nodes). Even bring up a graphical Desktop to help with the development and debug process. All of this from within a web browser. No need to install software or set up tunnels. 

SLURM Job Scheduler:

Programs can be run on the nodes by running a command to submit a "job script" to the SLURM scheduler. SLURM keeps track of all of the resources (e.g. CPU, GPU, Memory) on all of the nodes as well as all of the programs being run on the nodes. When new jobs are submitted, it is up to the SLURM scheduler to figure out what nodes should run the program and to start the requested program when the requested resources are available.

Software:

A wide range of software has been compiled or installed on the HPC system, including scientific packages, libraries and compilers. In fact, some packages have multiple versions installed. In order to make it easy to manage what packages should be used either interactively on the Login node, or in a job submitted to SLURM, the software is managed by the LMod Environment Module system. The "module" program can be run in a terminal manage what software is currently active in the terminal/shell environment. Since compilers are provided, people can also compile and install programs and packages into their own account. Similarly, since the Anaconda Python system is provides in a module, people can create their own Anaconda Virtual Environment and install Python packages into them. This puts full control into each users hands.

Containers:

Containers offer a way to install and run non-standard, complex or hard to install programs on a wide range of hardware and Operating Systems. A container encapsulates everything that is needed for the program to run into a single file. Then that file can be copied to a system that can run containers, even if that other system is running a completely different Operating System or Linux distribution. 

The ACG HPC systems use Singularity to run containers. Singularity is compatible with Docker, so a huge range of pre-packaged software is available for use. For instance, Nvidia provides a large catalog of Docker containers providing a wide range of software that is optimized to run on their GPU systems. We have converted some of these containers into Singularity and made them available as a software module to be used on our systems.