Job Submission Tutorial

Introduction:

When each account is created, a directory called pi_MPI is included that has a sample program and job submission script. This tutorial goes over how to compile the program and submit the job to be run on the cluster.

The pi_MPI directory:

If you have a new account, you will have a pi_MPI directory with the following contents. The "user" for this tutorial is "abol":

[abol@katahdin ~]$ ls -l pi_MPI

total 5

-rw-rw-r-- 1 abol abol 1146 Sep 9 14:53 go.slurm

-rw-r--r-- 1 abol abol 260 Aug 1 2012 Makefile

-rw-r--r-- 1 abol abol 1143 Sep 9 2004 pi_MPI.f

-rw-r--r-- 1 abol abol 780 Sep 9 14:51 README

The pi_MPI.f file is the parallel Fortran source code for calculating Pi. In order to create an executable program, this code will need to be compiled with a Fortran compiler and then linked into an executable binary. The Makefile is used to help compile and link the program with the MPI libraries.

In order to compile the source code and create the binary file, both a Fortran compiler and a MPI library need to be in the shell environment. By default, this is already set up using a Gnu compiler and a MVAPICH2 version of MPI that was built with the same version of the Gnu compiler. Here is how to check that they are there and then to compile and link the code:

[abol@katahdin ~]$ module list


Currently Loaded Modules:

1) autotools 2) prun/1.3 3) gnu8/8.3.0 4) mvapich2/2.3.2 5) ohpc


[abol@katahdin ~]$ cd pi_MPI/


[abol@katahdin pi_MPI]$ make

compiling with MPI version that is loaded

make -k any

make[1]: Entering directory `/home/abol/pi_MPI'

mpif77 -o pi_MPI pi_MPI.f

make[1]: Leaving directory `/home/abol/pi_MPI'


[abol@katahdin pi_MPI]$ ls -l

total 28

-rw-rw-r-- 1 abol abol 1146 Sep 9 14:53 go.slurm

-rw-r--r-- 1 abol abol 260 Aug 1 2012 Makefile

-rwxrwxr-x 1 abol abol 23792 Sep 9 15:40 pi_MPI

-rw-r--r-- 1 abol abol 1143 Sep 9 2004 pi_MPI.f

-rw-r--r-- 1 abol abol 780 Sep 9 14:51 README

This shows that both the "gnu8/8.3.0" compiler suite and the "mvapich2/2.3.2" version of MVAPICH2 are loaded into the environment and available to be used. Running the "make" command successfully compiled the pi_MPI.f code and linked it with the MPI libraries to create the new executable binary file called "pi_MPI".

In order to run this code on the cluster in parallel, it needs to be submitted to the SLURM scheduler. One way to do that is to use a SLURM script, which is a regular shell script that has directives in it that SLURM uses to know what resources the job needs to run. The file "go.slurm" is the job script for this program:

#!/bin/bash

#SBATCH --job-name=parallel_pi_test # Job name

#SBATCH --ntasks-per-node=4 # Run on a 4 cores per node

#SBATCH --nodes=2 # Run on 2 nodes

#SBATCH --mincpus=4 # Make sure 4 cores per node

#SBATCH --partition=haswell # Select partition

#SBATCH --mem=1gb # Job memory request

#SBATCH --time=00:05:00 # Time limit hrs:min:sec

#SBATCH --output=parallel_pi_%j.log # Standard output and error log

# If you want email notification uncomment below and edit email address

# #SBATCH --mail-type=begin # send email when job begins

# #SBATCH --mail-type=end # send email when job ends

# #SBATCH --mail-type=fail # send email if job fails

# #SBATCH --mail-user=email.address@domain.com


echo "====================================================="

pwd; hostname; date

echo "====================================================="


module load mvapich2

echo "Running pi_MPI"


srun ./pi_MPI

echo "====================================================="

date


echo " "

echo -n " Sleeping 240 seconds... "

sleep 240

echo "... done"

We won't go over the #SBATCH directive details for this tutorial but they can be found in the SLURM area of this web site. The main points of the script summarized her are:

  1. The job requires 8 CPU cores on two nodes.

  2. The partition that the job is to run in is called "haswell"

  3. Output from the job will be sent to a file named "parallel_pi_%j.log" where the %j part of the file name will be replaced by the Job ID that SLURM will give it.

In general, the script then just goes through the steps that we would normally run in a terminal if we wanted to run the program. A key point though is that once the job starts running on the two nodes, SLURM will change the directory automatically to the pi_MPI directory since that is where we will be submitting the job from. So, we do not need to put a command in the script like: "cd ~/pi_MPI".

Ordinarily with MPI programs, to run the program you would use the "mpirun" command. However, SLURM has the "srun" command that is used instead. So, the actual command that is used by the script to calculate Pi using 8 CPU cores is:

srun ./pi_MPI

We don't need to tell it how many processes to use since SLURM passes this information to the program automatically based on the CPU resources that the script has requested.

Since this program is so fast (it will take just a few seconds and most of that time is spent setting the job up to run on the nodes), there is a section of the script after the "srun" command that just says to sleep for 240 seconds. This will give us time to see the job in the queue.

To submit the job to SLURM use the command "sbatch go.slurm":

[abol@katahdin pi_MPI]$ sbatch go.slurm

Submitted batch job 961815

The output of this command shows that SLURM has given this job the ID 961815. To see that it is running use the "squeue -u abol" command:

[abol@katahdin pi_MPI]$ squeue -u abol

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

961815 haswell parallel abol R 1:07 2 node-[124-125]

From this output we see that it is running (STate = "R") and it has been running for a minute and 7 seconds using two nodes: node-124 and node-125. Based on the "--output" directive in the script, along with the ID that SLURM gave the job, the output file name should be: "parallel_pi_961815.out". To see what the contents of this file are you can run the "cat" command:

[abol@katahdin pi_MPI]$ cat parallel_pi_961815.log

=====================================================

/home/abol/pi_MPI

node-124.cluster

Fri Sep 9 16:08:55 EDT 2022

=====================================================

Running pi_MPI

pi is 3.1415926535867702 Error is 3.0229152514493762E-012

time is 3.5623960494995117 seconds

=====================================================

Fri Sep 9 16:09:00 EDT 2022

Sleeping 240 seconds...

After 240 seconds, the job will stop sleeping and it will end. If we decided that we wanted to end the job sooner, we can use the "scancel" command with the Job ID:

[abol@katahdin pi_MPI]$ scancel 961815