SLURM Basics: Running Jobs

Introduction:

The Katahdin cluster uses the SLURM Resource Manager and Scheduler. Below is some basic information. More to come!

Slurm Basics:

One way to submit jobs is to create a SLURM script that you submit to SLURM with the "sbatch" command. Here is a sample Job script:

#!/bin/bash

#SBATCH --job-name=my_job_name        # Job name

#SBATCH --partition=haswell           # Partition/Queue name

#SBATCH --mail-type=END,FAIL          # Mail events

#SBATCH --mail-user=email@maine.edu   # Where to send mail

#SBATCH --ntasks=1                    # Run a single task

#SBATCH --cpus-per-task=4             # Run with 4 threads

#SBATCH --mem=60gb                    # Job memory request

#SBATCH --time=24:00:00               # Time limit hrs:min:sec

#SBATCH --output=test_%j.log          # Standard output and error log


module load module_name ...


srun program param1 ...

The ntasks and cpus-per-task change depending on if you have a multi-threaded program (one program that uses multiple threads) vs. a multi-process program (multiple processes like with MPI). In the MPI case, you'd set --ntasks=50 to run with 50 processes. Other directives to control how the job is configured:

#SBATCH --ntasks-per-node=4           # Run on a 4 cores per node

#SBATCH --nodes=2                     # Run on a 2 nodes

This would be for (for example) an MPI job running with 8 processes.


Useful SLURM Commands

sbatch: Command to submit a job:

sbatch script-name

The email directives are optional. 

squeue: Command to Check all jobs in the queue:

squeue

or to check all jobs in the queue:

squeue -u user-name

Another form of this command is just "sq" which does the same thing except in a little different format. The main additional information that this command gives is the total number of cores for the job in the second to last column.

sinfo: Command to get the status of all of the Slurm partitions:

sinfo


PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST

debug           up   infinite      1    mix node-153

haswell*        up   infinite      4  down* node-[127-129,139]

haswell*        up   infinite     15    mix node-[55-58,61,81,90,122-123,125-126,130,140-142]

haswell*        up   infinite     67   idle node-[59,63-80,82-89,91-121,124,131-138]

haswell-test    up   infinite      1   idle node-62

skylake         up   infinite      1 drain* node-148

skylake         up   infinite      4    mix node-[143-144,149-150]

skylake         up   infinite      3   idle node-[145-147]

dgx             up   infinite      1    mix dgx

grtx            up   infinite      1   idle grtx-1

epyc            up   infinite      4    mix node-[151,153,155,158,163]

epyc            up   infinite      1  alloc node-152

epyc            up   infinite      7   idle node-[154,156-157,159-162,164]

epyc-hm         up   infinite      2  alloc node-[167-168]

epyc-hm         up   infinite      2   idle node-[169-170]

scancel: Delete a job:

scancel JOB_ID


checkjob: Checking on the status of a job:


checkjob JOB_ID

This command mimics the same command name in the Moab scheduler that we used to use. Sample output:

[abol@katahdin pi_MPI]$ checkjob 962658

JobId=962658 JobName=parallel_pi_test

   UserId=abol(1028) GroupId=abol(1003) MCS_label=N/A

   Priority=10010 Nice=0 Account=(null) QOS=normal

   JobState=RUNNING Reason=None Dependency=(null)

   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   DerivedExitCode=2:0

   RunTime=00:00:10 TimeLimit=00:05:00 TimeMin=N/A

   SubmitTime=2022-09-14T12:29:08 EligibleTime=2022-09-14T12:29:08

   AccrueTime=2022-09-14T12:29:08

   StartTime=2022-09-14T12:29:08 EndTime=2022-09-14T12:34:08 Deadline=N/A

   PreemptTime=None SuspendTime=None SecsPreSuspend=0

   LastSchedEval=2022-09-14T12:29:08

   Partition=haswell AllocNode:Sid=katahdin:9998

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=node-[140-141]

   BatchHost=node-140

   NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   TRES=cpu=8,mem=2G,node=2,billing=8

   Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=*

     Nodes=node-[140-141] CPU_IDs=16-19 Mem=0 GRES_IDX=

   MinCPUsNode=4 MinMemoryNode=1G MinTmpDiskNode=0

   Features=(null) DelayBoot=00:00:00

   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

   Command=/home/abol/pi_MPI/go.slurm

   WorkDir=/home/abol/pi_MPI

   StdErr=/home/abol/pi_MPI/parallel_pi_962658.log

   StdIn=/dev/null

   StdOut=/home/abol/pi_MPI/parallel_pi_962658.log

   Power=


seff : Command to check the memory and CPU efficiency of a job. This command is mostly useful after a job has completed:


seff JOB_ID


for instance:


[root@katahdin slurm]# seff 962658

Job ID: 962658

Cluster: katahdin

Use of uninitialized value $user in concatenation (.) or string at /bin/seff line 154, <DATA> line 628.

User/Group: /abol

State: COMPLETED (exit code 0)

Nodes: 2

Cores per node: 4

CPU Utilized: 00:00:00

CPU Efficiency: 0.00% of 00:32:08 core-walltime

Job Wall-clock time: 00:04:01

Memory Utilized: 508.00 KB

Memory Efficiency: 0.02% of 2.00 GB

You have mail in /var/spool/mail/root


tail -f: Since you are logged into Katahdin and your job is running Checking output files

Partitions:

In Slurm, the term "Partition" refers to a set of nodes. Other Resource Managers refer to these as Queues. Currently, the list of partitions on the Katahdin cluster is:


The list of partitions, along with the current state of the nodes in the partitions can be retrieved with the "sinfo" command. For instance:


[cousins@katahdin ~]$ sinfo

PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST

debug*          up   infinite      1  drain node-60

haswell         up   infinite      7  down* node-[52-54,127-129,139]

haswell         up   infinite      1  drain node-57

haswell         up   infinite     23    mix node-[56,80,82-87,94-95,112,115,126,130-138,142]

haswell         up   infinite      5  alloc node-[55,58-59,61,81]

haswell         up   infinite     53   idle node-[63-79,88-93,96-111,113-114,116-125,140-141]

haswell-test    up   infinite      1  drain node-62

himem           up   infinite      1  down* node-51

skylake         up   infinite      1 drain* node-148

skylake         up   infinite      1    mix node-149

skylake         up   infinite      6   idle node-[143-147,150]

dgx             up   infinite      1    mix dgx

grtx            up   infinite      1   idle grtx-1

testing         up   infinite      4  alloc node-[151-154]


Getting more details about a job:

The checkjob command can be used to get more information about job:


checkjob $JOB_ID

where $JOB_ID is the ID for the job that you want information about.


Interactive Jobs: 

In general, you submit jobs to SLURM and the job gets sent to a node or set of nodes and it runs in the background. That is, after you submit the job, you are returned to the shell prompt and you can continue on with what you are doing. Occasionally, you might find that you want to interact with the job, directly on the node that it is running. You can do this by using the srun command like:


srun --partition=grtx --ntasks=1 --cpus-per-task=4 --mem=64gb --gres=gpu:1 --time=10:00:00 --pty /bin/bash


This job is asking to run a job in the "grtx" partition with 4 CPU cores, one GPU and 64 GB of RAM for 10 hours. Once the resources are available, the terminal will be logged into the node that is running the job and you will be at a prompt. From there, you can run commands. This is particularly helpful when you want to use the nvcc command on a GPU node to compile with CUDA.