Batch system¶

As described in the Hardware Overview chapter, users only have direct access to the four login nodes of HoreKa. Access to the compute nodes is only possible through the so-called batch system. The batch system on HoreKa is Slurm.

slurm

Slurm is an open source, fault-tolerant, and highly scalable job scheduling system for large and small Linux clusters. Slurm fulfills three key functions:

It allocates exclusive and/or non-exclusive access to resources - the compute nodes - to users for some duration of time so they can perform work.
It provides a framework for starting, executing, and monitoring work on the set of allocated nodes.
It arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of HoreKa requires the users to define a sequence of commands to be executed and a specification of the required run time, number of CPU cores and main memory etc. Such a combination of a list of commands and some metadata is called a batch job. Batch jobs have to be submitted to Slurm and are executed in an asynchronous manner as soon as the system decides to run them.

HAICORE batch system partitions¶

Slurm manages job queues for different "partitions". Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits. On HAICORE@KIT 2 different job queues are available.

normal: This queue can be used by everyone with access to the cluster. Here up to 1 node per job with a maximum runtime of 3 days can be used.
advanced: For these queues a special privilege is needed. All nodes can be used including the 3 DGX A100 nodes with a maximum job time of 3 days. In order to get access to the advanced queues new users have to open a ticket stating name, nationality, and the field of research.

Queue	Node type(s)	Access policy	Minimum resources	Default resources	Maximum resources
normal	GPU4	Shared	nodes=1, ntasks=1	time=00:10:00, ntasks=1, gres=gpu:full:1, mem-per-gpu=125400, cpu-per-gpu=38	time=72:00:00, nodes=1, ntasks=152, gres=gpu:full:4, mem=501600mb
advanced	GPU4	Shared	nodes=1, ntasks=1	time=00:10:00, ntasks=1, mem-per-gpu=125000mb	time=72:00:00, nodes=10, ntasks=152, gres=gpu:full:4, mem=501600mb
advanced-gpu8	DGX A100	Shared	nodes=1, ntasks=1	time=00:10:00, ntasks=1, mem-per-gpu=125000mb	time=72:00:00, nodes=3, ntasks=256, gres=gpu:full:8, mem=1000000mb

Resource Restrictions¶

HAICORE@KIT is a cluster that provides Ad-hoc usage for an initial playing with the data and/or AI methods. In order to guarantee fair usage of Resources, certain restrictions are set on a per user basis. There is a limit of 5000 GPUh per year and the amount of jobs that can run at the same time is set to 3. Afterwards resources cannot be used anymore on HAICORE@KIT. The resources are reset at the beginning of each year. If more Resources are needed consider a project proposal at HAICORE@FZJ . For more information see here.

Slurm Usage¶

The official Slurm documentation is quite exhaustive, so this documentation only focuses on the most important commands and use cases.

Slurm commmands	Brief explanation
sbatch	Submits a job and queues it
salloc	Submits an interactive job and blocks until it completes
scontrol show job	Displays detailed job state information
squeue	Displays information about active, eligible, blocked, and/or recently completed jobs
squeue --start	Returns start time of submitted job or requested resources
gpu_avail	Shows how many GPUs are currently idle in the normal queue (if any)
scancel	Cancels a job (obsoleted!)

You can also access the documentation as manpages on the cluster, e.g. man sbatch.

Job Submission with sbatch/salloc¶

Batch jobs are submitted by using the command sbatch. The main purpose of the sbatch command is to specify the resources that are needed to run the job. sbatch will then queue the batch job. However, starting of batch job depends on the availability of the requested resources.

sbatch Command Parameters¶

sbatch options can be specificied as command line parameters or by adding special pragmas to your job script.

Command	Script	Purpose
`-t "time"` or `--time="time"`	`#SBATCH --time="time"`	Wall clock time limit.
`-N "count"` or `--nodes="count"`	`#SBATCH --nodes="count"`	Number of nodes to be used.
`-n "count"` or `--ntasks="count"`	`#SBATCH --ntasks="count"`	Number of tasks to be launched.
`--ntasks-per-node="count"`	`#SBATCH --ntasks-per-node="count"`	Maximum count (< 77) of tasks per node.
`-c "count"` or `--cpus-per-task="count"`	`#SBATCH --cpus-per-task="count"`	Number of CPUs required per (MPI-)task.
`--mem="value_in_MB"`	`#SBATCH --mem="value_in_MB"`	Memory in MegaByte per node. (You should omit the setting of this option.)
`--mem-per-cpu="value_in_MB"`	`#SBATCH --mem-per-cpu="value_in_MB"`	Minimum Memory required per allocated CPU. (You should omit the setting of this option.)
`--mail-type="type"`	`#SBATCH --mail-type="type"`	Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL.
`--mail-user="mail-address"`	`#SBATCH --mail-user="mail-address"`	The specified mail-address receives email notification of statechanges as defined by --mail-type.
`--output="name"`	`#SBATCH --output="name"`	File in which job output is stored.
`--error="name"`	`#SBATCH --error="name"`	File in which job error messages are stored.
`-J "name"` or `--job-name="name"`	`#SBATCH --job-name=''name''`	Job name.
`--export=[ALL,] ''env-variables''`	`#SBATCH --export=[ALL,] "env-variables"`	Identifies which environment variables from the sub- mission environment are propagated to the launched application. Default is ALL. If adding to the submission environment instead of replacing it is intended, the argument ALL must be added.
`-A "group-name"` or `--account="group-name"`	`#SBATCH --account="group-name"`	Charge resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=".
`-p "queue-name"` or `--partition="queue-name"`	`#SBATCH --partition="queue-name"`	Request a specific queue for the resource allocation.
`-C "LSDF"` or `--constraint="LSDF"`	`#SBATCH --constraint=LSDF`	Job constraint to use the LSDF file system.
`-C "BEEOND"` or `--constraint="BEEOND"`	`#SBATCH --constraint=BEEOND`	Job constraint to request the BeeOND file system.

Examples¶

Requesting full GPUs¶

To request a gpu resource the option --gres=gpu:full:AMOUNT has to be specified. In the normal queue the amount of GPUs requested at once is limited to 4.

Multi-Instance GPU (MIG)¶

In order to supply more people with compute resources, NVIDIA's Multi-Instance GPU (MIG) is activated on 2 nodes. It allows for the secure partitioning of the GPU into up to seven separate GPU instances for CUDA applications. With MIG, users will be able to see and schedule jobs on their new virtual GPU Instances as if they were physical GPUs. The Profiles are named as follows: AMOUNT_COMPUTE_SLICES.GPU_MEMORY.

Currently there are 3 different GPU Profiles available:

Name	Amount	Slurm Option
1g.5gb	16	`--gres=gpu:1g.5gb:1`
2g.10gb	4	`--gres=gpu:2g.10gb:1`
4g.20gb	8	`--gres=gpu:4g.20gb:1`

Slurm Option

The Slurm options during submit are --gres=gpu:PROFILE:AMOUNT_OF_INSTANCES

With the command nvidia-smi the allocated MIG devices can be viewed:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:31:00.0 Off |                   On |
| N/A   43C    P0    71W / 400W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Free GPUs¶

To see how many GPUs or MIG instances are currently idle in the normal queue the command gpu_avail can be used:

gpu_avail 
full:     19  GPUs  free
1g.5gb:   15  GPUs  free
2g.10gb:  4   GPUs  free
4g.20gb:  7   GPUs  free

Environment variables¶

The following environment variables are available within batch jobs while they are running:

Environment	Brief explanation
SLURM_JOB_CPUS_PER_NODE	Number of processes per node dedicated to the job
SLURM_JOB_NODELIST	List of nodes dedicated to the job
SLURM_JOB_NUM_NODES	Number of nodes dedicated to the job
SLURM_MEM_PER_NODE	Memory per node dedicated to the job
SLURM_NPROCS	Total number of processes dedicated to the job
SLURM_CLUSTER_NAME	Name of the cluster executing the job
SLURM_CPUS_PER_TASK	Number of CPUs requested per task
SLURM_JOB_ACCOUNT	Account name
SLURM_JOB_ID	Job ID
SLURM_JOB_NAME	Job Name
SLURM_JOB_PARTITION	Partition/queue running the job
SLURM_JOB_UID	User ID of the job's owner
SLURM_SUBMIT_DIR	Job submit folder (the directory from which sbatch was invoked)
SLURM_JOB_USER	User name of the job's owner
SLURM_RESTART_COUNT	Number of times job has restarted
SLURM_PROCID	Task ID (MPI rank)
SLURM_NTASKS	The total number of tasks available for the job
SLURM_STEP_ID	Job step ID
SLURM_STEP_NUM_TASKS	Task count (number of PI ranks)
SLURM_JOB_CONSTRAINT	Job constraints

Energy measurement¶

Slurm allows for the measurement of consumed energy on a job basis. This is automatically done for every job startet on HoreKa.

The energy consumed can be checked after a job has finished with either scontrol show job in the energy field or with sacct using the --format option and the field ConsumedEnergy e.g.:

$ sacct --format User,Account,JobID,JobName,ConsumedEnergy,NodeList,Elapsed,State

Furthermore a job feedback is appended to every job outputfile. Here both energy consumed in Joule / Watthours and the average node power draw are displayed.

To display the energy consumed during a running job sstat can be used e.g.:

$ sstat <JOB_ID> --format JobID,ConsumedEnergy,NodeList

Energy Measurement

The energy is measured on the node level, meaning only in case of exclusive job allocation the energy consumption measurements will reflect the job's real consumption. The values show the amount of consumed energy of all involved nodes but not the interconnect/filesystem.