SLURM Basics: Running Jobs

Introduction:

The Katahdin cluster uses the SLURM Resource Manager and Scheduler. Below is some basic information. More to come!

Slurm Basics:

One way to submit jobs is to create a SLURM script that you submit to SLURM with the "sbatch" command. Here is a sample Job script:

#!/bin/bash

#SBATCH --job-name=my_job_name # Job name

#SBATCH --partition=haswell # Partition/Queue name

#SBATCH --mail-type=END,FAIL # Mail events

#SBATCH --mail-user=email@maine.edu # Where to send mail

#SBATCH --ntasks=1 # Run a single task

#SBATCH --cpus-per-task=4 # Run with 4 threads

#SBATCH --mem=60gb # Job memory request

#SBATCH --time=24:00:00 # Time limit hrs:min:sec

#SBATCH --output=test_%j.log # Standard output and error log

module load module_name ...

srun program param1 ...

The ntasks and cpus-per-task change depending on if you have a multi-threaded program (one program that uses multiple threads) vs. a multi-process program (multiple processes like with MPI). In the MPI case, you'd set --ntasks=50 to run with 50 processes. Other directives to control how the job is configured:

#SBATCH --ntasks-per-node=4 # Run on a 4 cores per node

#SBATCH --nodes=2 # Run on a 2 nodes

This would be for (for example) an MPI job running with 8 processes.

Useful SLURM Commands

sbatch: Command to submit a job:

sbatch script-name

The email directives are optional.

squeue: Command to Check all jobs in the queue:

squeue

or to check all jobs in the queue:

squeue -u user-name

Another form of this command is just "sq" which does the same thing except in a little different format. The main additional information that this command gives is the total number of cores for the job in the second to last column.

sinfo: Command to get the status of all of the Slurm partitions:

sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

debug up infinite 1 mix node-153

haswell* up infinite 4 down* node-[127-129,139]

haswell* up infinite 15 mix node-[55-58,61,81,90,122-123,125-126,130,140-142]

haswell* up infinite 67 idle node-[59,63-80,82-89,91-121,124,131-138]

haswell-test up infinite 1 idle node-62

skylake up infinite 1 drain* node-148

skylake up infinite 4 mix node-[143-144,149-150]

skylake up infinite 3 idle node-[145-147]

dgx up infinite 1 mix dgx

grtx up infinite 1 idle grtx-1

epyc up infinite 4 mix node-[151,153,155,158,163]

epyc up infinite 1 alloc node-152

epyc up infinite 7 idle node-[154,156-157,159-162,164]

epyc-hm up infinite 2 alloc node-[167-168]

epyc-hm up infinite 2 idle node-[169-170]

scancel: Delete a job:

scancel JOB_ID

checkjob: Checking on the status of a job:

checkjob JOB_ID

This command mimics the same command name in the Moab scheduler that we used to use. Sample output:

[abol@katahdin pi_MPI]$ checkjob 962658

JobId=962658 JobName=parallel_pi_test

UserId=abol(1028) GroupId=abol(1003) MCS_label=N/A

Priority=10010 Nice=0 Account=(null) QOS=normal

JobState=RUNNING Reason=None Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

DerivedExitCode=2:0

RunTime=00:00:10 TimeLimit=00:05:00 TimeMin=N/A

SubmitTime=2022-09-14T12:29:08 EligibleTime=2022-09-14T12:29:08

AccrueTime=2022-09-14T12:29:08

StartTime=2022-09-14T12:29:08 EndTime=2022-09-14T12:34:08 Deadline=N/A

PreemptTime=None SuspendTime=None SecsPreSuspend=0

LastSchedEval=2022-09-14T12:29:08

Partition=haswell AllocNode:Sid=katahdin:9998

ReqNodeList=(null) ExcNodeList=(null)

NodeList=node-[140-141]

BatchHost=node-140

NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

TRES=cpu=8,mem=2G,node=2,billing=8

Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=*

Nodes=node-[140-141] CPU_IDs=16-19 Mem=0 GRES_IDX=

MinCPUsNode=4 MinMemoryNode=1G MinTmpDiskNode=0

Features=(null) DelayBoot=00:00:00

OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

Command=/home/abol/pi_MPI/go.slurm

WorkDir=/home/abol/pi_MPI

StdErr=/home/abol/pi_MPI/parallel_pi_962658.log

StdIn=/dev/null

StdOut=/home/abol/pi_MPI/parallel_pi_962658.log

Power=

seff : Command to check the memory and CPU efficiency of a job. This command is mostly useful after a job has completed:

seff JOB_ID

for instance:

[root@katahdin slurm]# seff 962658

Job ID: 962658

Cluster: katahdin

Use of uninitialized value $user in concatenation (.) or string at /bin/seff line 154, <DATA> line 628.

User/Group: /abol

State: COMPLETED (exit code 0)

Nodes: 2

Cores per node: 4

CPU Utilized: 00:00:00

CPU Efficiency: 0.00% of 00:32:08 core-walltime

Job Wall-clock time: 00:04:01

Memory Utilized: 508.00 KB

Memory Efficiency: 0.02% of 2.00 GB

You have mail in /var/spool/mail/root

tail -f: Since you are logged into Katahdin and your job is running Checking output files

Partitions:

In Slurm, the term "Partition" refers to a set of nodes. Other Resource Managers refer to these as Queues. Currently, the list of partitions on the Katahdin cluster is:

debug : general debugging of code. Currently just a single node
haswell : The largest partition as far as the number of nodes and cores. Around 90 nodes, each with Intel Haswell or Broadwell CPUs with either 24 or 28 cores and 64 GB or 128 GB of RAM
skylake : 8 Intel Skylake nodes, each with 36 cores and 256 GB of RAM
dgx: A single Nvidia DGX A100 node with two AMD Epyc2 CPUs with 128/256 CPU Cores/Threads, 1 TB of RAM and 8 Nvidia A100 GPUs, each with 40 GB of GPU RAM
grtx: A single node with a single AMD Epyc2 CPU with 32 cores, 768 GB of RAM and 8 Nvidia RTX 2080Ti GPUs with 11 GB of GPU RAM
epyc: a new partition to access the new AMD EPYC3 nodes. These 14 nodes each have 96 cores and 512 GB of RAM
epyc-hm: these four nodes have AMD EPYC3 CPUs, 32 cores each node and with 1 TB of RAM

The list of partitions, along with the current state of the nodes in the partitions can be retrieved with the "sinfo" command. For instance:

[cousins@katahdin ~]$ sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

debug* up infinite 1 drain node-60

haswell up infinite 7 down* node-[52-54,127-129,139]

haswell up infinite 1 drain node-57

haswell up infinite 23 mix node-[56,80,82-87,94-95,112,115,126,130-138,142]

haswell up infinite 5 alloc node-[55,58-59,61,81]

haswell up infinite 53 idle node-[63-79,88-93,96-111,113-114,116-125,140-141]

haswell-test up infinite 1 drain node-62

himem up infinite 1 down* node-51

skylake up infinite 1 drain* node-148

skylake up infinite 1 mix node-149

skylake up infinite 6 idle node-[143-147,150]

dgx up infinite 1 mix dgx

grtx up infinite 1 idle grtx-1

testing up infinite 4 alloc node-[151-154]

Getting more details about a job:

The checkjob command can be used to get more information about job:

checkjob $JOB_ID

where $JOB_ID is the ID for the job that you want information about.

Interactive Jobs:

In general, you submit jobs to SLURM and the job gets sent to a node or set of nodes and it runs in the background. That is, after you submit the job, you are returned to the shell prompt and you can continue on with what you are doing. Occasionally, you might find that you want to interact with the job, directly on the node that it is running. You can do this by using the srun command like:

srun --partition=grtx --ntasks=1 --cpus-per-task=4 --mem=64gb --gres=gpu:1 --time=10:00:00 --pty /bin/bash

This job is asking to run a job in the "grtx" partition with 4 CPU cores, one GPU and 64 GB of RAM for 10 hours. Once the resources are available, the terminal will be logged into the node that is running the job and you will be at a prompt. From there, you can run commands. This is particularly helpful when you want to use the nvcc command on a GPU node to compile with CUDA.