Batch system¶
As described in the Hardware Overview chapter, users only have direct access to the two login nodes of the Future Technologies Partition. Access to the compute nodes is only possible through the so-called batch system. The batch system on FTP is Slurm.
Slurm is an open source, fault-tolerant, and highly scalable job scheduling system for large and small Linux clusters. Slurm fulfills has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Any kind of calculation on the compute nodes of HoreKa requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.
FTP-A64 ARM batch system queues¶
Queue | Node type(s) | Access policy | Minimum resources | Default resources | Maximum resources |
---|---|---|---|---|---|
a64fx | A64FX | Exclusive | nodes=1, ntasks=1 | time=00:30:00, ntasks=48, mem=28000mb | time=24:00:00, nodes=8, ntasks=48, mem=28000mb |
nvidia100_2 | ARM-A100 | Exclusive | nodes=1, ntasks=1 | time=00:30:00, ntasks=80, mem-per-cpu=6350mb | time=24:00:00, nodes=4, ntasks=80, mem=522400mb |
dual_a_max | Dual ARM Altra max | Exclusive | nodes=1, ntasks=1 | time=00:30:00, ntasks=256, mem-per-cpu=2035mb | time=24:00:00, nodes=6, ntasks=256, mem=520960mb |
grace_grace | Grace-Grace | Exclusive | nodes=1, ntasks=1 | time=00:30:00, ntasks=144, mem-per-cpu=3402mb | time=24:00:00, nodes=6, ntasks=144, mem=489960mb |
FTP-X86 batch system queues¶
Queue | Node type(s) | Access policy | Minimum resources | Default resources | Maximum resources |
---|---|---|---|---|---|
intel-clv100 | Cascade Lake + NVIDIA V100 | Exclusive | nodes=1, ntasks=1 | time=00:30:00, ntasks=80, mem-per-cpu=192000mb | time=24:00:00, nodes=4, ntasks=80, mem=192000mb |
amd-milan-mi100 | AMD Milan + MI100 | Shared | nodes=1, ntasks=1 | time=00:30:00, ntasks=2, mem=8025mb | time=24:00:00, nodes=1, ntasks=64, mem=513600mb |
amd-milan-mi100 | AMD Milan + MI210 | Shared | nodes=1, ntasks=1 | time=00:30:00, ntasks=2, mem=8025mb | time=24:00:00, nodes=1, ntasks=64, mem=513600mb |
amd-milan-mi250 | AMD Milan + MI250 | Shared | nodes=1, ntasks=1 | time=00:30:00, ntasks=1, mem=8025mb | time=24:00:00, nodes=2, ntasks=256, mem=1027200mb |
amd-milan-graphcore | AMD Milan + Graphcore | Shared | nodes=1, ntasks=1 | time=00:30:00, ntasks=2, mem=8025mb | time=24:00:00, nodes=1, ntasks=128, mem=513600mb |
intel-spr | Intel Sapphire Rapids | Shared | nodes=1 ntasks=1 | time=00:30:00, ntasks=2, mem=6420mb | time=24:00:00, nodes=2, ntasks=80, mem=513600mb |
intel-spr-hbm | Intel Sapphire Rapids + HBM | Shared | nodes=1 ntasks=1 | time=00:30:00, ntasks=2, mem=8075mb | time=24:00:00, nodes=2, ntasks=80, mem=646000mb |
intel-spr-pvc | Intel Sapphire Rapids + Ponte Vecchio | Shared | nodes=1 ntasks=1 | time=00:30:00, ntasks=1, mem=6420mb | time=24:00:00, nodes=1, ntasks=80, mem=513600mb |
amd-milan-mi300 | AMD Instinct MI300A Accelerators | Shared | nodes=1 ntasks=1 | time=00:30:00, ntasks=2, mem=5356mb | time=24:00:00, nodes=1, ntasks=192, mem=514176mb |