Batch system¶
As described in the Hardware Overview chapter, users only have direct access to the four login nodes of HoreKa. Access to the compute nodes is only possible through the so-called batch system. The batch system on HoreKa is Slurm.
Slurm is an open source, fault-tolerant, and highly scalable job scheduling system for large and small Linux clusters. Slurm fulfills three key functions:
- It allocates exclusive and/or non-exclusive access to resources - the compute nodes - to users for some duration of time so they can perform work.
- It provides a framework for starting, executing, and monitoring work on the set of allocated nodes.
- It arbitrates contention for resources by managing a queue of pending work.
Any kind of calculation on the compute nodes of HoreKa requires the users to define a sequence of commands to be executed and a specification of the required run time, number of CPU cores and main memory etc. Such a combination of a list of commands and some metadata is called a batch job. Batch jobs have to be submitted to Slurm and are executed in an asynchronous manner as soon as the system decides to run them.
HoreKa batch system partitions¶
Slurm manages job queues for different "partitions". Partitions are used to group similar node types (e.g. nodes with and without accelerators) and to enforce different access policies and resource limits.
On HoreKa there are two main partitions: accelerated
for the nodes with GPUs (HoreKa Green) and cpuonly
for the nodes without GPUs (HoreKa Blue). Since heavy computations on the login nodes are not allowed, there are also two so-called development queues which can be used for short interactive jobs and use cases like building a particularly big piece of software.
Queue | Node type(s) | Access policy | Minimum resources | Default resources | Maximum resources |
---|---|---|---|---|---|
dev_cpuonly | Standard and High Memory | Shared | nodes=1, ntasks=1 | time=00:10:00, ntasks=1, mem-per-cpu=1600mb | time=4:00:00, nodes=12, ntasks=152, mem=243200mb |
dev_accelerated | HoreKa Green | Shared | nodes=1, ntasks=1, gres=gpu:1 | time=00:10:00, ntasks=1, mem-per-gpu=125400, cpu-per-gpu=38 | time=1:00:00, nodes=2, ntasks=152, gres=gpu:4, mem=501600mb |
dev_accelerated-h100 | HoreKa Teal | Shared | nodes=1, ntasks=1, gres=gpu:1 | time=00:10:00, ntasks=1, mem-per-gpu=193000, cpu-per-gpu=32 | time=1:00:00, nodes=1, ntasks=128, gres=gpu:4, mem=772000mb |
cpuonly | Standard and High Memory | Exclusive | nodes=1, ntasks=152 | time=00:10:00, ntasks=152, mem=243200mb, mem-per-cpu=1600mb | time=3-00:00:00, nodes=192, ntasks=152, mem=501600mb |
accelerated | HoreKa Green | Shared | nodes=1, ntasks=1, gres=gpu:1 | time=00:10:00, ntasks=1, mem-per-gpu=125400, cpu-per-gpu=38 | time=2-00:00:00, nodes=128, ntasks=152, gres=gpu:4, mem=501600mb |
large | Extra Large Memory | Shared | nodes=1, ntasks=1 | time=00:10:00, ntasks=1, mem-per-cpu=27130mb | time=2-00:00:00, nodes=8, ntasks=152, mem=4123930mb |
accelerated-h100 | HoreKa Teal | Shared | nodes=1, ntasks=1, gres=gpu:1 | time=00:10:00, ntasks=1, mem-per-gpu=193000, cpu-per-gpu=32 | time=2-00:00:00, nodes=16, ntasks=128, gres=gpu:4, mem=772000mb |
All Intel nodes have 76 physical cores and Simultaneous Multithreading (SMT, also called "Hypterthreading") with 2 threads per core is activated. This results in 76 * 2 = 152 possible threads per node (hence ntasks=152
). Your application may profit from SMT, but using more than 76 tasks per node can also significantly impact performance depending on the exact instructions performed.
The AMD nodes have 64 physical cores which results in 128 possible threads.
The cpuonly
queue contains both Standard and High Memory nodes. The Standard nodes have a memory limit of 243200 MB per node, and the High Memory nodes have a limit of 501600 MB. This is why the default memory per CPU is set to 1600 MB (152 * 1580 MB = 243200 MB). Slurm will automatically schedule jobs to the nodes with sufficient resources, but there are only 32 High Memory nodes. Please make sure to only request more than 243200 MB of memory per node in cpuonly
if you really need to.
The development queues dev_
are for short, usually interactive jobs that are used for developing, compiling and testing code and workflows. The intention behind development queues is to provide users with immediate access to computer resources without having to wait. This is the place to realize instantaneous heavy compute without affecting other users, as would be the case on the login nodes.
Development queues
Do not misuse this queues for regular, short-running jobs or chain jobs! Only one running job at a time is enabled.
Slurm Usage¶
The official Slurm documentation is quite exhaustive, so this documentation only focuses on the most important commands and use cases.
Slurm commmands | Brief explanation |
---|---|
sbatch | Submits a job and queues it |
salloc | Submits an interactive job and blocks until it completes |
scontrol show job | Displays detailed job state information |
squeue | Displays information about active, eligible, blocked, and/or recently completed jobs |
squeue --start | Returns start time of submitted job or requested resources |
sinfo_t_idle | Shows how many nodes are currently idle (if any) |
scancel | Cancels a job (obsoleted!) |
You can also access the documentation as manpages on the cluster, e.g. man sbatch
.
Job Submission with sbatch/salloc¶
Batch jobs are submitted by using the command sbatch. The main purpose of the sbatch command is to specify the resources that are needed to run the job. sbatch will then queue the batch job. However, starting of batch job depends on the availability of the requested resources.
sbatch Command Parameters¶
sbatch options can be specificied as command line parameters or by adding special pragmas to your job script.
Command | Script | Purpose |
---|---|---|
-t "time" or --time="time" |
#SBATCH --time="time" |
Wall clock time limit. |
-N "count" or --nodes="count" |
#SBATCH --nodes="count" |
Number of nodes to be used. |
-n "count" or --ntasks="count" |
#SBATCH --ntasks="count" |
Number of tasks to be launched. |
--ntasks-per-node="count" |
#SBATCH --ntasks-per-node="count" |
Maximum count (< 77) of tasks per node. |
-c "count" or --cpus-per-task="count" |
#SBATCH --cpus-per-task="count" |
Number of CPUs required per (MPI-)task. |
--mem="value_in_MB" |
#SBATCH --mem="value_in_MB" |
Memory in MegaByte per node. (You should omit the setting of this option.) |
--mem-per-cpu="value_in_MB" |
#SBATCH --mem-per-cpu="value_in_MB" |
Minimum Memory required per allocated CPU. (You should omit the setting of this option.) |
--mail-type="type" |
#SBATCH --mail-type="type" |
Notify user by email when certain event types occur. Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL. |
--mail-user="mail-address" |
#SBATCH --mail-user="mail-address" |
The specified mail-address receives email notification of statechanges as defined by --mail-type. |
--output="name" |
#SBATCH --output="name" |
File in which job output is stored. |
--error="name" |
#SBATCH --error="name" |
File in which job error messages are stored. |
-J "name" or --job-name="name" |
#SBATCH --job-name=''name'' |
Job name. |
--export=[ALL,] ''env-variables'' |
#SBATCH --export=[ALL,] "env-variables" |
Identifies which environment variables from the sub- mission environment are propagated to the launched application. Default is ALL. If adding to the submission environment instead of replacing it is intended, the argument ALL must be added. |
-A "group-name" or --account="group-name" |
#SBATCH --account="group-name" |
Charge resources used by this job to specified group. You may need this option if your account is assigned to more than one group. By command "scontrol show job" the project group the job is accounted on can be seen behind "Account=". |
-p "queue-name" or --partition="queue-name" |
#SBATCH --partition="queue-name" |
Request a specific queue for the resource allocation. |
-C "LSDF" or --constraint="LSDF" |
#SBATCH --constraint=LSDF |
Job constraint to use the LSDF file system. |
-C "BEEOND" ("BEEOND_4MDS", "BEEOND_MAXMDS") or --constraint="BEEOND" ("BEEOND_4MDS", "BEEOND_MAXMDS") |
#SBATCH --constraint=BEEOND (BEEOND_4MDS, BEEOND_MAXMDS) |
Job constraint to request the BeeOND file system. |
Examples¶
Environment variables¶
The following environment variables are available within batch jobs while they are running:
Environment | Brief explanation |
---|---|
SLURM_JOB_CPUS_PER_NODE | Number of processes per node dedicated to the job |
SLURM_JOB_NODELIST | List of nodes dedicated to the job |
SLURM_JOB_NUM_NODES | Number of nodes dedicated to the job |
SLURM_MEM_PER_NODE | Memory per node dedicated to the job |
SLURM_NPROCS | Total number of processes dedicated to the job |
SLURM_CLUSTER_NAME | Name of the cluster executing the job |
SLURM_CPUS_PER_TASK | Number of CPUs requested per task |
SLURM_JOB_ACCOUNT | Account name |
SLURM_JOB_ID | Job ID |
SLURM_JOB_NAME | Job Name |
SLURM_JOB_PARTITION | Partition/queue running the job |
SLURM_JOB_UID | User ID of the job's owner |
SLURM_SUBMIT_DIR | Job submit folder (the directory from which sbatch was invoked) |
SLURM_JOB_USER | User name of the job's owner |
SLURM_RESTART_COUNT | Number of times job has restarted |
SLURM_PROCID | Task ID (MPI rank) |
SLURM_NTASKS | The total number of tasks available for the job |
SLURM_STEP_ID | Job step ID |
SLURM_STEP_NUM_TASKS | Task count (number of PI ranks) |
SLURM_JOB_CONSTRAINT | Job constraints |
Energy measurement¶
Slurm allows for the measurement of consumed energy on a job basis. This is automatically done for every job startet on HoreKa.
The energy consumed can be checked after a job has finished with either scontrol show job in the energy field or with sacct using the --format option and the field ConsumedEnergy e.g.:
$ sacct --format User,Account,JobID,JobName,ConsumedEnergy,NodeList,Elapsed,State
Furthermore a job feedback is appended to every job outputfile. Here both energy consumed in Joule / Watthours and the average node power draw are displayed.
To display the energy consumed during a running job sstat can be used e.g.:
$ sstat <JOB_ID> --format JobID,ConsumedEnergy,NodeList
Energy Measurement
The energy is measured on the node level, meaning only in case of exclusive job allocation the energy consumption measurements will reflect the job's real consumption. The values show the amount of consumed energy of all involved nodes but not the interconnect/filesystem.