Maintenance¶
Maintenance 15.04. - 19.04.2024¶
The following changes were performed during maintenance:
-
All firmware versions on all components were upgraded
-
The operating system was upgraded to Red Hat Enterprise Linux (RHEL) 8.8. We recommend to re-compile all applications after the upgrade.
-
The Mellanox OFED InfiniBand stack was upgraded
-
Slurm was upgraded
-
File system clients and servers were updated
-
Compiler and MPI versions and the software modules built against them were updated. Modules of deprecated versions will be removed.
-
Without the Slurm option -C LSDF the LSDF is no longer mounted inside the job.
Maintenance 26.10.2023¶
Due to parametrization works on the infrastructure of the Horeka compute centre and maintenance the Cluster will not be available on the 26.10.2023 from 8:30 to 19:00 o'clock
The following changes will be performed during the maintenance:
-
Slurm will be upgraded to version 23.02.6
- In Slurm versions 23.02, --ntasks-per-core applies to job and step allocations. If set to 1, it will now imply --cpu-bind=cores. Otherwise, if set to a value greater than 1, it will imply --cpu-bind=threads. For jobs using intel mpi and the slurm option --ntasks-per-core, you will need to export SLURM_CPU_BIND=NONE in the job environment.
- Changed task_prolog.hk -> task_prolog
-
NVIDIA driver will be upgraded to the most recent version (535.104.12 or higher)
-
The bandwidth to LSDF online storage will be increased
Security Update 10.08.2023¶
On 10.08.2023 a short interruption of regular operation has taken place to address multiple security vulnerabilities in Intel and AMD microarchitectures. A malicious actor may use these vulnerabilities for unauthorized access to the contents of the vector registers, thus leaking potentially sensitive information.
In order to mitigate the aforementioned vulnerabilities new versions of Intel and AMD microcode were installed and a reboot of the affected nodes was carried out over the following weekend.
As a result of the microcode update a performance drop of 5% to 10% under normal workloads might be observed on Intel Platforms . This is due to the update restricting the execution of the gather instruction provided by the Intel Advanced Vector Extensions 2 (Intel AVX2) and Intel Advanced Vector Extensions 512 (Intel AVX-512). For more information please refer to the technical paper
Maintenance 11.04. - 12.05.2023¶
To prepare the HPC data center at the Steinbuch Centre for Computing (SCC) at KIT for the installation of HoreKa Phase 2 and future HPC systems, extensive construction work had to be carried out on the building infrastructure. Additionally the following changes were performed during the regular maintenance (17.04. - 21.04.2023):
-
All firmware versions on all components were upgraded
-
The operating system was upgraded to Red Hat Enterprise Linux (RHEL) 8.6. We recommend to re-compile all applications after the upgrade.
-
The Mellanox OFED InfiniBand stack was upgraded
-
The NVIDIA driver was upgraded
-
pigz and pbzip are not supported anymore. Please use pzstd instead
-
Slurm was upgraded to version 22.05.8
-
File system clients (Spectrum Scale, Lustre and BeeGFS) were updated
-
Spectrum Scale file system servers were updated
-
The file systems home and work have been extended with 1 PB fast NVMe SSD storage. No user action is required to use this new storage. New files will be automatically stored on the SSDs. Old and large files will be transparently migrated from the SSDs to the slower HDDs if the disk space on the SSDs fills up.
-
Singularity was replaced with its successor Apptainer, Enroot was upgraded
-
Compiler and MPI versions and the software modules built against them were be updated. Modules of deprecated versions were removed. Some additional modules will be added later on.
-
After the maintenance the following per-user limits apply (via cgroups) on login nodes: 48 GB physical memory, 400% CPU cycles (100% equals 1 thread)
Maintenance 19.04. - 26.04.2022¶
The following changes have been performed during the maintenance:
-
All firmware versions on all components have been upgraded
-
The operating system version is now based on Red Hat Enterprise Linux (RHEL) 8.4. We recommend to re-compile all applications after the upgrade.
-
The Mellanox OFED InfiniBand stack has been upgraded.
-
The obsolete Intel compiler version 18.0 has been removed. The officially supported Intel compiler versions are now 19.0, 19.1 and 2021.4.0 (oneAPI).
-
LLVM version 14 was added. Older LLVM modules have been removed.
-
OpenMPI 4.0 and 4.1 have been updated to the latest patchlevel. OpenMPI 3.0 has been removed.
-
Many software modules have been updated and built against the new compiler and MPI versions
-
The system Python version 3.9 was added. If no other Python module is loaded, the command python3.9 defaults to version 3.9.2, the command python3.8 defaults to version 3.8.6, the commands python3 and python default to version 3.6.8 and the command python2 defaults to version 2.7.18.
-
The hpc-workspace tools have been updated to version 1.3.7.
-
The Lmod module system has been upgraded.
-
cmake 3.23 has been added.
-
Slurm has been upgraded to version 21.08.7.
-
HKFS Storage: new controller firmware
-
The Spectrum Scale, Lustre and BeeGFS file system clients were updated
-
The NVIDIA driver was upgraded to version 510.47.03. Cuda version 11.6 has been added.
-
Enroot has been updated to 3.4.0.
-
Singularity has been updated to 3.8.7.
-
Jupyterhub version has been upgraded to 2.2.2.
29.11.2021 Change to Memory Management settings¶
On 29.11.2021 we will change two operating system settings affecting memory management on HoreKa. We expect that these changes can make your applications run faster.
Enablement of "Transparent Huge Pages"¶
The size of memory pages will be increased. This decreases the memory management overhead for many memory access patterns, resulting in a speedup.
More information can be found here. The value will be set to "always".
Activation of "zone_reclaim_mode"¶
More intensive attempts than before are made to create new memory pages in the NUMA-domain (CPU socket) in which the associated process is running. This avoids (slower) memory accesses to the memory of other CPU sockets.
More information can be found here. The value will be set to "1" (Zone reclaim on).
Maintenance 12.07. - 16.07.2021¶
From July 12th 9:00 am until July 16th noon no compute nodes will be available on HoreKa and HAICORE, so no jobs will run. Additionally, individual login nodes will be unavailable for some time during this interval, which will also affect the Jupyter and CI services.
Maintenance 28.05.2021¶
HoreKa will go into full operation on 01.06.2021 as planned. To perform the last preparatory steps, a planned maintenance interval has taken place on
Friday, 28.05.2021 between 08:00 and 15:00
The login nodes have been reinstalled and restarted. Running jobs have not been aborted. Waiting jobs have started automatically after maintenance was completed. Data on the parallel file systems was preserved, data stored locally on the nodes (e.g. in /tmp) has been deleted.
Please note that the following major changes have taken place:
-
The configuration of the batch system partitions has been adjusted. In particular, the memory limit for some node types in the queue "cpuonly" has been reduced to 1600 MB per CPU. Please check the updated information and adjust your job scripts if necessary.
-
The Python modules
devel/python/2.7
anddevel/python/3.6
have been renamed todevel/python/2.7_intel
anddevel/python/3.6_intel
. Please note that you do not need to load a module to use the default system Python version 3.6.8. -
Python version 3.8 has been made available in addition to the default system version 3.6.8. You can either use the
python3.8
andpip3.8
commands to explicitly request this version. Or you can load thedevel/python/3.8
module, which will override thepython
andpip
commands.