SLURM Basics: Running Jobs
Introduction:
The Katahdin cluster uses the SLURM Resource Manager and Scheduler. Below is some basic information. More to come!
Slurm Basics:
One way to submit jobs is to create a SLURM script that you submit to SLURM with the "sbatch" command. Here is a sample Job script:
#!/bin/bash
#SBATCH --job-name=my_job_name # Job name
#SBATCH --partition=haswell # Partition/Queue name
#SBATCH --mail-type=END,FAIL # Mail events
#SBATCH --mail-user=email@maine.edu # Where to send mail
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=4 # Run with 4 threads
#SBATCH --mem=60gb # Job memory request
#SBATCH --time=24:00:00 # Time limit hrs:min:sec
#SBATCH --output=test_%j.log # Standard output and error log
module load module_name ...
srun program param1 ...
The ntasks and cpus-per-task change depending on if you have a multi-threaded program (one program that uses multiple threads) vs. a multi-process program (multiple processes like with MPI). In the MPI case, you'd set --ntasks=50 to run with 50 processes. Other directives to control how the job is configured:
#SBATCH --ntasks-per-node=4 # Run on a 4 cores per node
#SBATCH --nodes=2 # Run on a 2 nodes
This would be for (for example) an MPI job running with 8 processes.
Useful SLURM Commands
sbatch: Command to submit a job:
sbatch script-name
The email directives are optional.
squeue: Command to Check all jobs in the queue:
squeue
or to check all jobs in the queue:
squeue -u user-name
Another form of this command is just "sq" which does the same thing except in a little different format. The main additional information that this command gives is the total number of cores for the job in the second to last column.
sinfo: Command to get the status of all of the Slurm partitions:
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug up infinite 1 mix node-153
haswell* up infinite 4 down* node-[127-129,139]
haswell* up infinite 15 mix node-[55-58,61,81,90,122-123,125-126,130,140-142]
haswell* up infinite 67 idle node-[59,63-80,82-89,91-121,124,131-138]
haswell-test up infinite 1 idle node-62
skylake up infinite 1 drain* node-148
skylake up infinite 4 mix node-[143-144,149-150]
skylake up infinite 3 idle node-[145-147]
dgx up infinite 1 mix dgx
grtx up infinite 1 idle grtx-1
epyc up infinite 4 mix node-[151,153,155,158,163]
epyc up infinite 1 alloc node-152
epyc up infinite 7 idle node-[154,156-157,159-162,164]
epyc-hm up infinite 2 alloc node-[167-168]
epyc-hm up infinite 2 idle node-[169-170]
scancel: Delete a job:
scancel JOB_ID
checkjob: Checking on the status of a job:
checkjob JOB_ID
This command mimics the same command name in the Moab scheduler that we used to use. Sample output:
[abol@katahdin pi_MPI]$ checkjob 962658
JobId=962658 JobName=parallel_pi_test
UserId=abol(1028) GroupId=abol(1003) MCS_label=N/A
Priority=10010 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=2:0
RunTime=00:00:10 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2022-09-14T12:29:08 EligibleTime=2022-09-14T12:29:08
AccrueTime=2022-09-14T12:29:08
StartTime=2022-09-14T12:29:08 EndTime=2022-09-14T12:34:08 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2022-09-14T12:29:08
Partition=haswell AllocNode:Sid=katahdin:9998
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node-[140-141]
BatchHost=node-140
NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=2G,node=2,billing=8
Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=*
Nodes=node-[140-141] CPU_IDs=16-19 Mem=0 GRES_IDX=
MinCPUsNode=4 MinMemoryNode=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/abol/pi_MPI/go.slurm
WorkDir=/home/abol/pi_MPI
StdErr=/home/abol/pi_MPI/parallel_pi_962658.log
StdIn=/dev/null
StdOut=/home/abol/pi_MPI/parallel_pi_962658.log
Power=
seff : Command to check the memory and CPU efficiency of a job. This command is mostly useful after a job has completed:
seff JOB_ID
for instance:
[root@katahdin slurm]# seff 962658
Job ID: 962658
Cluster: katahdin
Use of uninitialized value $user in concatenation (.) or string at /bin/seff line 154, <DATA> line 628.
User/Group: /abol
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 4
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:32:08 core-walltime
Job Wall-clock time: 00:04:01
Memory Utilized: 508.00 KB
Memory Efficiency: 0.02% of 2.00 GB
You have mail in /var/spool/mail/root
tail -f: Since you are logged into Katahdin and your job is running Checking output files
Partitions:
In Slurm, the term "Partition" refers to a set of nodes. Other Resource Managers refer to these as Queues. Currently, the list of partitions on the Katahdin cluster is:
debug : general debugging of code. Currently just a single node
haswell : The largest partition as far as the number of nodes and cores. Around 90 nodes, each with Intel Haswell or Broadwell CPUs with either 24 or 28 cores and 64 GB or 128 GB of RAM
skylake : 8 Intel Skylake nodes, each with 36 cores and 256 GB of RAM
dgx: A single Nvidia DGX A100 node with two AMD Epyc2 CPUs with 128/256 CPU Cores/Threads, 1 TB of RAM and 8 Nvidia A100 GPUs, each with 40 GB of GPU RAM
grtx: A single node with a single AMD Epyc2 CPU with 32 cores, 768 GB of RAM and 8 Nvidia RTX 2080Ti GPUs with 11 GB of GPU RAM
epyc: a new partition to access the new AMD EPYC3 nodes. These 14 nodes each have 96 cores and 512 GB of RAM
epyc-hm: these four nodes have AMD EPYC3 CPUs, 32 cores each node and with 1 TB of RAM
The list of partitions, along with the current state of the nodes in the partitions can be retrieved with the "sinfo" command. For instance:
[cousins@katahdin ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 drain node-60
haswell up infinite 7 down* node-[52-54,127-129,139]
haswell up infinite 1 drain node-57
haswell up infinite 23 mix node-[56,80,82-87,94-95,112,115,126,130-138,142]
haswell up infinite 5 alloc node-[55,58-59,61,81]
haswell up infinite 53 idle node-[63-79,88-93,96-111,113-114,116-125,140-141]
haswell-test up infinite 1 drain node-62
himem up infinite 1 down* node-51
skylake up infinite 1 drain* node-148
skylake up infinite 1 mix node-149
skylake up infinite 6 idle node-[143-147,150]
dgx up infinite 1 mix dgx
grtx up infinite 1 idle grtx-1
testing up infinite 4 alloc node-[151-154]
Getting more details about a job:
The checkjob command can be used to get more information about job:
checkjob $JOB_ID
where $JOB_ID is the ID for the job that you want information about.
Interactive Jobs:
In general, you submit jobs to SLURM and the job gets sent to a node or set of nodes and it runs in the background. That is, after you submit the job, you are returned to the shell prompt and you can continue on with what you are doing. Occasionally, you might find that you want to interact with the job, directly on the node that it is running. You can do this by using the srun command like:
srun --partition=grtx --ntasks=1 --cpus-per-task=4 --mem=64gb --gres=gpu:1 --time=10:00:00 --pty /bin/bash
This job is asking to run a job in the "grtx" partition with 4 CPU cores, one GPU and 64 GB of RAM for 10 hours. Once the resources are available, the terminal will be logged into the node that is running the job and you will be at a prompt. From there, you can run commands. This is particularly helpful when you want to use the nvcc command on a GPU node to compile with CUDA.