Batch system¶

As described in the Hardware Overview chapter, users only have direct access to the two login nodes of the Future Technologies Partition. Access to the compute nodes is only possible through the so-called batch system. The batch system on FTP is Slurm.

Slurm is an open source, fault-tolerant, and highly scalable job scheduling system for large and small Linux clusters. Slurm fulfills has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Any kind of calculation on the compute nodes of HoreKa requires the user to define calculations as a sequence of commands or single command together with required run time, number of CPU cores and main memory and submit all, i.e., the batch job, to a resource and workload managing software. Therefore any job submission by the user is to be executed by commands of the Slurm software. Slurm queues and runs user jobs based on fair sharing policies.

FTP-A64 ARM batch system queues¶

Queue	Node type(s)	Access policy	Minimum resources	Default resources	Maximum resources
a64fx	A64FX	Exclusive	nodes=1, ntasks=1	time=00:30:00, ntasks=48, mem=28000mb	time=24:00:00, nodes=8, ntasks=48, mem=28000mb
nvidia100_2	ARM-A100	Exclusive	nodes=1, ntasks=1	time=00:30:00, ntasks=80, mem-per-cpu=6350mb	time=24:00:00, nodes=4, ntasks=80, mem=522400mb
dual_a_max	Dual ARM Altra max	Exclusive	nodes=1, ntasks=1	time=00:30:00, ntasks=256, mem-per-cpu=2035mb	time=24:00:00, nodes=6, ntasks=256, mem=520960mb
grace_grace	Grace-Grace	Exclusive	nodes=1, ntasks=1	time=00:30:00, ntasks=144, mem-per-cpu=3402mb	time=24:00:00, nodes=6, ntasks=144, mem=489960mb

FTP-X86 batch system queues¶

Queue	Node type(s)	Access policy	Minimum resources	Default resources	Maximum resources
amd-milan-mi100	AMD Milan + MI100	Shared	nodes=1, ntasks=1	time=00:30:00, ntasks=2, mem=8025mb	time=24:00:00, nodes=1, ntasks=64, mem=513600mb
amd-milan-mi100	AMD Milan + MI210	Shared	nodes=1, ntasks=1	time=00:30:00, ntasks=2, mem=8025mb	time=24:00:00, nodes=1, ntasks=64, mem=513600mb
amd-milan-mi250	AMD Milan + MI250	Shared	nodes=1, ntasks=1	time=00:30:00, ntasks=1, mem=8025mb	time=24:00:00, nodes=2, ntasks=256, mem=1027200mb
amd-milan-graphcore	AMD Milan + Graphcore	Shared	nodes=1, ntasks=1	time=00:30:00, ntasks=2, mem=8025mb	time=24:00:00, nodes=1, ntasks=128, mem=513600mb
intel-spr	Intel Sapphire Rapids	Shared	nodes=1 ntasks=1	time=00:30:00, ntasks=2, mem=6420mb	time=24:00:00, nodes=2, ntasks=80, mem=513600mb
intel-spr-hbm	Intel Sapphire Rapids + HBM	Shared	nodes=1 ntasks=1	time=00:30:00, ntasks=2, mem=8075mb	time=24:00:00, nodes=2, ntasks=80, mem=646000mb
intel-spr-pvc	Intel Sapphire Rapids + Ponte Vecchio	Shared	nodes=1 ntasks=1	time=00:30:00, ntasks=1, mem=6420mb	time=24:00:00, nodes=1, ntasks=80, mem=513600mb
amd-milan-mi300	AMD Instinct MI300A Accelerators	Shared	nodes=1 ntasks=1	time=00:30:00, ntasks=2, mem=5356mb	time=24:00:00, nodes=1, ntasks=192, mem=514176mb
gaudi2	Intel Gaudi 2 AI Accelerator	Shared	nodes=1 ntasks=1	time=00:30:00, ntasks=2, mem=25.2gb	time=24:00:00, nodes=3, ntasks=240, mem=1006gb