Maintenance¶
Maintenance 02.12. - 03.12.2024¶
AFFECTED: SVC: Nationales HPC (Tier-2)
IMPACT: Whole maintenance period
Reason: Function change (update or upgrade)
DESCRIPTION: On 03.12. and 04.12.2024, conversion work will be carried out on the hot water cooling circuit in building 449.3. For this reason, HoreKa and HAICORE will not be available on these days (03.12. from 8:00 a.m.).
As soon as the work is completed, we will reactivate HoreKa and HAICORE for use; this is expected to happen in the course of 04.12.2024. We will use the downtime to install updates.
Maintenance 15.04. - 19.04.2024¶
The following changes were performed during maintenance:
-
All firmware versions on all components were upgraded
-
The operating system was upgraded to Red Hat Enterprise Linux (RHEL) 8.8. We recommend to re-compile all applications after the upgrade.
-
The Mellanox OFED InfiniBand stack was upgraded
-
Slurm was upgraded
-
File system clients and servers were updated
-
Compiler and MPI versions and the software modules built against them were updated. Modules of deprecated versions will be removed.
Maintenance 26.10.2023¶
Due to parametrization works on the infrastructure of the Horeka compute centre and maintenance the Cluster will not be available on the 26.10.2023 from 8:30 to 19:00 o'clock
The following changes will be performed during the maintenance:
-
Slurm will be upgraded to version 23.02.6
- In Slurm versions 23.02, --ntasks-per-core applies to job and step allocations. If set to 1, it will now imply --cpu-bind=cores. Otherwise, if set to a value greater than 1, it will imply --cpu-bind=threads. For jobs using intel mpi and the slurm option --ntasks-per-core, you will need to export SLURM_CPU_BIND=NONE in the job environment.
- Changed task_prolog.hk -> task_prolog
-
NVIDIA driver will be upgraded to the most recent version (535.104.12 or higher)
-
The bandwidth to LSDF online storage will be increased
Security Update 10.08.2023¶
On 10.08.2023 a short interruption of regular operation has taken place to address multiple security vulnerabilities in Intel and AMD microarchitectures. A malicious actor may use these vulnerabilities for unauthorized access to the contents of the vector registers, thus leaking potentially sensitive information.
In order to mitigate the aforementioned vulnerabilities new versions of Intel and AMD microcode were installed and a reboot of the affected nodes was carried out over the following weekend.
As a result of the microcode update a performance drop of 5% to 10% under normal workloads might be observed on Intel Platforms . This is due to the update restricting the execution of the gather instruction provided by the Intel Advanced Vector Extensions 2 (Intel AVX2) and Intel Advanced Vector Extensions 512 (Intel AVX-512). For more information please refer to the technical paper
Maintenance 11.04. - 12.05.2023¶
To prepare the HPC data center at the Steinbuch Centre for Computing (SCC) at KIT for the installation of HoreKa Phase 2 and future HPC systems, extensive construction work had to be carried out on the building infrastructure. Additionally the following changes were performed during the regular maintenance (17.04. - 21.04.2023):
-
All firmware versions on all components were upgraded
-
The operating system was upgraded to Red Hat Enterprise Linux (RHEL) 8.6. We recommend to re-compile all applications after the upgrade.
-
The Mellanox OFED InfiniBand stack was upgraded
-
The NVIDIA driver was upgraded
-
pigz and pbzip are not supported anymore. Please use pzstd instead
-
Slurm was upgraded to version 22.05.8
-
File system clients (Spectrum Scale, Lustre and BeeGFS) were updated
-
Spectrum Scale file system servers were updated
-
The file systems home and work have been extended with 1 PB fast NVMe SSD storage. No user action is required to use this new storage. New files will be automatically stored on the SSDs. Old and large files will be transparently migrated from the SSDs to the slower HDDs if the disk space on the SSDs fills up.
-
Singularity was replaced with its successor Apptainer, Enroot was upgraded
-
Compiler and MPI versions and the software modules built against them were be updated. Modules of deprecated versions were removed. Some additional modules will be added later on.
-
After the maintenance the following per-user limits apply (via cgroups) on the login node: 48 GB phyisical memory, 400% CPU cycles (100% equals 1 thread)
Maintenance 12.07. - 16.07.2021¶
From July 12th 9:00 am until July 16th noon, no compute nodes will be available on HoreKa and HAICORE, so no jobs will run. Additionally, individual login nodes will be unavailable for some time during this interval, which will also affect the Jupyter and CI services.