7. Running Jobs on Expanse

In this Section:

Parallel Models

Parallel Models: Distributed Memory

Distributed Memory architecture

  • Programs that run asynchronously, pass messages for communication and coordination between resources.
  • Examples include: SOA-based systems, massively multiplayer online games, peer-to-peer apps.
  • Different types of implementations for the message passing mechanism: HTTP, RPC-like connectors, message queues
  • HPC historically uses the Message Passing Interface (MPI)

Parallel Models: Shared Memory

Shared Memory architecture

  • CPUs all share same localized memory (SHMEM)
    • Coordination and communication between tasks via interprocessor communication (IPC) or virtual memory mappings.
  • May use: uniform or non-uniform memory access (UMA or NUMA); cache-only memory architecture (COMA).
  • Most common HPC API's for using SHMEM:
    • Portable Operating System Interface (POSIX); Open Multi-Processing (OpenMP) designed for parallel computing – best for multi-core computing.

Methods for Running Jobs on Expanse

Batch Jobs

  • Batch Jobs are used to submit batch scripts to Slurm from the login nodes. You need to specify
    • Partition (queue)
    • Time limit for the run (maximum of 48 hours)
    • Number of nodes, tasks per node; Memory requirements (if any)
    • Job name, output file location; Email info, configuration
  • When you run in the batch mode, you submit jobs to be run on the compute nodes using the sbatch command. 
  • Remember that computationally intensive jobs should be run only on the compute nodes and not the login nodes.
  • Expanse places limits on the number of jobs queued and running on a per group (allocation) and partition basis.
  • Please note that submitting a large number of jobs (especially very short ones) can impact the overall  scheduler response for all users.

Interactive Jobs

  • Interactive Jobs: Use the srun command to obtain nodes for ‘real-time, live' command line access to a compute node: CPU:
    srun --partition=debug --qos=debug-normal --pty --account=abc123 --nodes=1 --ntasks-per-node=128 --mem=248 -t 00:30:00 --wait=0 --export=ALL /bin/bash
    

GPU:

srun   --pty --account=abc123  --nodes=1   --ntasks-per-node=1   --cpus-per-task=10   -p gpu-debug  --gpus=1  -t 00:10:00 /bin/bash

Command Line Jobs

The login nodes are meant for compilation, file editing, simple data analysis, and other tasks that use minimal compute resources. Do not run parallel or large jobs on the login nodes - even for simple tests. Even if you could run a simple test on the command line on the login node, full tests should not be run on the login node because the performance will be adversely impacted by all the other tasks and login activities of the other users who are logged onto the same node. For example, at the moment that this note was written, a gzip process was consuming 98% of the CPU time on the Comet HPC system: [user@comet-ln3 OPENMP]$ top ... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19937 XXXXX 20 0 4304 680 300 R 98.2 0.0 0:19.45 gzip

Commands that you type into the terminal and run on the sytem are considered jobs and they consume resources. Computationally intensive jobs should be run only on the compute nodes and not the login nodes.

Slurm Resource Manager

Sections:

Expanse uses the Simple Linux Utility for Resource Management (Slurm) resource manager. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters https://Slurm.schedmd.com/documentation.html. Follow this link to the Slurm Quick Start Guide

logging on to Expanse Slurm Architecture
Login nodes to cluster diagram Slurm Scheduler Architecture
User logs onto Expanse, and submits a batch script to the Slurm Controller daemon Slurm parses the batch script for correct syntax and then queues up the job until the requested resources are available
  • Slurm is the "Glue" for parallel computer to schedule and execute jobs
    • Role: Allocate resources within a cluster
    • Nodes (unique IP address)
    • Interconnect/switches
    • Generic resources (e.g. GPUs)
    • Launch and otherwise manage jobs

Slurm Partitions

About Partitions

Partition Name Max Walltime Max Nodes/ Job Max Running Jobs Max Running + Queued Jobs Charge Factor Comments
compute 48 hrs 32 32 64 1 Used for exclusive access to regular compute nodes
shared 48 hrs 1 4096 4096 1 Single-node jobs using fewer than 128 cores
gpu 48 hrs 4 4 8 (32 Tres GPU) 1 Used for exclusive access to the GPU nodes
gpu-shared 48 hrs 1 24 24 (24 Tres GPU) 1 Single-node job using fewer than 4 GPUs
large-shared 48 hrs 1 1 4 1 Single-node jobs using large memory up to 2 TB (minimum memory required 256G)
debug 30 min 2 1 2 1 Priority access to compute nodes set aside for testing of jobs with short walltime and limited resources
gpu-debug 30 min 2 1 2 1 ** Priority access to gpu nodes set aside for testing of jobs with short walltime and limited resources
preempt 7 days 32   128 0.8 Discounted jobs to run on free nodes that can be pre-empted by jobs submited to any other queue (NO REFUNDS)
gpu-preempt 7 days 1   24 (24 Tres GPU) 0.8 Discounted jobs to run on unallocated nodes that can be pre-empted by jobs submitted to higher priority queues (NO REFUNDS

Slurm Job State Codes

Slurm has a large number of states. The table below lists some of the most common states you will see. For full details, See the section on Slurm JOB STATE CODES, at the Slurm Squeue data page.

STATE LABEL DESCRIPTION
CA CANCELLED Job was explicitly cancelled by the user or system administrator.
C COMPLETED job is Complete/Clearing
F FAILED Job terminated with non-zero exit code or other failure condition.
PD PENDING Job is awaiting resource allocation.
R RUNNING Job currently has an allocation.
ST STOPPED Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
TO TIMEOUT Job terminated upon reaching its time limit.

Common Slurm Commands

Here are a few key Slurm commands. For more information, run the man Slurm or see this page:

COMMAND DESCRIPTION
scancel Used to signal or cancel jobs, job arrays or job steps.
sbatch Submit a batch script to Slurm.
sinfo View information about Slurm nodes and partitions.
squeue Used to view job and job step information for jobs managed by Slurm.
srun Run a parallel job on cluster managed by Slurm.
  • Submit jobs using the sbatch command:
$ sbatch mycode-slurm.sb 
  • Submitted batch job 8718049. Check job status using the squeue command.
$ squeue -u $USER
             JOBID PARTITION     NAME     USER      ST       TIME  NODES  NODELIST(REASON)
           8718049   compute       mycode user   PD       0:00       1               (Priority)
  • Once the job is running:
    $ squeue -u $USER
                 JOBID PARTITION     NAME     USER    ST       TIME  NODES  NODELIST(REASON)
               8718049     debug        mycode user   R         0:02      1           expanse-14-01
    
  • Cancel a running job:
$ scancel 8718049

Slurm Batch Script Example

Below is an example of a batch script that prints our your environment on the compute node:

[user@login01 ENV_INFO]$ cat env-slurm.sb 
 !/bin/bash
 SBATCH --job-name="envinfo"
 SBATCH --output="envinfo.%j.%N.out"
 SBATCH --partition=compute
 SBATCH --nodes=1
 SBATCH --ntasks-per-node=1
 SBATCH --export=ALL
 SBATCH --account=abc123
 SBATCH -t 00:01:00

#  Environment
module purge
module load slurm
module load cpu
#   perform some basic unix commands
echo "----------------------------------"
echo "hostname= " `hostname` 
echo "date= " `date` 
echo "whoami= " `whoami` 
echo "pwd= " `pwd` 
echo "module list= " `module list` 
echo "----------------------------------"
echo "env= " `env` 
echo "----------------------------------"
  • Portion of the output generated by this script:
[user@login01 ENV_INFO]$ cat envinfo.108867.exp-6-56.out 
----------------------------------
hostname=  exp-6-56
date=  Wed Oct 7 23:45:43 PDT 2020
whoami=  user
pwd=  /home/user/DEMO/ENV_INFO
Currently Loaded Modules:
  1) slurm/expanse/20.02.3   2) cpu/1.0
----------------------------------
env=  SLURM_MEM_PER_CPU=1024 LD_LIBRARY_PATH=/cm/shared/apps/Slurm/current/lib64/Slurm:/cm/shared/apps/Slurm/current/lib64 LS_COLORS=rs=0

[SNIP]

 MODULESHOME=/usr/share/lmod/lmod LMOD_SETTARG_FULL_SUPPORT=no HISTSIZE=5000 LMOD_PKG=/usr/share/lmod/lmod LMOD_CMD=/usr/share/lmod/lmod/libexec/lmod Slurm_LOCALID=0 LESSOPEN=||/usr/bin/lesspipe.sh %s LMOD_FULL_SETTARG_SUPPORT=no LMOD_DIR=/usr/share/lmod/lmod/libexec BASH_FUNC_module%%=() { eval $($LMOD_CMD bash "$@") && eval $(${LMOD_SETTARG_CMD:-:} -s sh) } BASH_FUNC_ml%%=() { eval $($LMOD_DIR/ml_cmd "$@") } _=/usr/bin/env
----------------------------------