Maintenance¶
Maintenance 02.12. - 03.12.2024¶
AFFECTED: SVC: Nationales HPC (Tier-2)
IMPACT: Whole maintenance period
Reason: Function change (update or upgrade)
DESCRIPTION: On 03.12. and 04.12.2024, conversion work will be carried out on the hot water cooling circuit in building 449.3. For this reason, HoreKa and HAICORE will not be available on these days (03.12. from 8:00 a.m.).
As soon as the work is completed, we will reactivate HoreKa and HAICORE for use; this is expected to happen in the course of 04.12.2024. We will use the downtime to install updates.
Maintenance 15.04. - 19.04.2024¶
The following changes were be performed during maintenance:
-
All firmware versions on all components were upgraded
-
The operating system of FTP-X86 were upgraded to Red Hat Enterprise Linux (RHEL) 8.8. We recommend to re-compile all applications after the upgrade.
-
The Mellanox OFED InfiniBand stack was upgraded
-
Slurm was upgraded
-
File system clients and servers on FTP-X86 were updated
-
Compiler and MPI versions and the software modules built against them were updated. Modules of deprecated versions were removed.
Security Update 10.08.2023¶
On 10.08.2023 a short interruption of regular operation has taken place to address multiple security vulnerabilities in Intel and AMD microarchitectures. A malicious actor may use these vulnerabilities for unauthorized access to the contents of the vector registers, thus leaking potentially sensitive information.
In order to mitigate the aforementioned vulnerabilities new versions of Intel and AMD microcode were installed and a reboot of the affected nodes was carried out over the following weekend.
As a result of the microcode update a performance drop of 5% to 10% under normal workloads might be observed on Intel Platforms . This is due to the update restricting the execution of the gather instruction provided by the Intel Advanced Vector Extensions 2 (Intel AVX2) and Intel Advanced Vector Extensions 512 (Intel AVX-512). For more information please refer to the technical paper
Maintenance 11.04. - 12.05.2023¶
To prepare the HPC data center at the Steinbuch Centre for Computing (SCC) at KIT for the installation of HoreKa Phase 2 and future HPC systems, extensive construction work had to be carried out on the building infrastructure. Additionally the following changes were performed during the regular maintenance (17.04. - 21.04.2023):
-
All firmware versions on all FTP-X86 components were upgraded, the firmware on FTP-A64 components will be upgraded during the May downtime
-
The operating system of FTP-X86 was upgraded to Red Hat Enterprise Linux (RHEL) 8.6. We recommend to re-compile all applications after the upgrade.
-
The Mellanox OFED InfiniBand stack was upgraded
-
The NVIDIA driver was upgraded
-
pigz and pbzip are not supported anymore. Please use pzstd instead
-
Slurm was upgraded to version 22.05.8
-
File system clients (Spectrum Scale, Lustre and BeeGFS) on FTP-X86 were updated
-
Spectrum Scale file system servers were updated
-
The file systems home and work have been extended with 1 PB fast NVMe SSD storage. No user action is required to use this new storage. New files will be automatically stored on the SSDs. Old and large files will be transparently migrated from the SSDs to the slower HDDs if the disk space on the SSDs fills up.
-
Singularity was replaced with its successor Apptainer, Enroot was upgraded
-
Compiler and MPI versions and the software modules built against them were be updated. Modules of deprecated versions were removed. Some additional modules will be added later on.
-
After the maintenance the following per-user limits apply (via cgroups) on login nodes: 48 GB phyisical memory, 400% CPU cycles (100% equals 1 thread)
Maintenance FTP-a64 29.11. - 30.11.2022¶
The following changes have been performed during the maintenance:
-
There is a new dedicated login node. From now on you have to use the DNS name
ftp-a64-login.scc.kit.edu
to log into this cluster. -
The operating system on the nodes is Rocky Linux 8.6.
-
The module system and other helper scripts are similar to those on HoreKa.
-
In addition to the Fujitsu A64FX nodes there are 4 additional nodes containing 2 NVIDIA A100 accelerators and an Ampere Altra Q80-32 host CPU each.
-
The login node also has an Ampere Altra Q80-32 host CPU, so one can compile and run software there. The software modules are available on the login node.
-
The following SLURM partitions are available: nvidia100_2, a64fx
Maintenance FTP-x86 19.04. - 26.04.2022¶
The following changes have been performed during the maintenance:
-
All firmware versions on all components have been upgraded
-
The operating system version is now based on Red Hat Enterprise Linux (RHEL) 8.4. We recommend to re-compile all applications after the upgrade.
-
The Mellanox OFED InfiniBand stack has been upgraded.
-
The obsolete Intel compiler version 18.0 has been removed. The officially supported Intel compiler versions are now 19.0, 19.1 and 2021.4.0 (oneAPI).
-
LLVM version 14 was added. Older LLVM modules have been removed.
-
OpenMPI 4.0 and 4.1 have been updated to the latest patchlevel. OpenMPI 3.0 has been removed.
-
Many software modules have been updated and built against the new compiler and MPI versions
-
The system Python version 3.9 was added. If no other Python module is loaded, the command python3.9 defaults to version 3.9.2, the command python3.8 defaults to version 3.8.6, the commands python3 and python default to version 3.6.8 and the command python2 defaults to version 2.7.18.
-
The hpc-workspace tools have been updated to version 1.3.7.
-
The Lmod module system has been upgraded.
-
cmake 3.23 has been added.
-
Slurm has been upgraded to version 21.08.7.
-
HKFS Storage: new controller firmware
-
The Spectrum Scale, Lustre and BeeGFS file system clients were updated
-
The NVIDIA driver will be upgraded to version 510.47.03. Cuda version 11.6 has been added.
-
Enroot has been updated to 3.4.0.
-
Singularity has been updated to 3.8.7.
-
Jupyterhub version has been upgraded to 2.2.2.
Maintenance 28.09. - 30.09.2021¶
The following extensive changes have been performed:
-
A new Graphcore IPU-POD16 system has been be installed as part of the FTP-X86 cluster.
-
The FTP-X86n2 node (Cascade Lake + 1x V100) has been converted into a login node, removing the login node role from the FTP-X86 head node. From now on you have to use the DNS name
ftp-x86-login.scc.kit.edu
to log into this cluster. Please note that your SSH client will likely show a warning because the IP address of a known server has changed. -
The NVIDIA V100 GPUs have been removed from the FTP-X86n[1,2] nodes and put into the FTP-X86n[3,4] nodes, turning these two nodes into 2x GPU nodes.
-
The InfinityFabric bridges necessary for fast Inter-GPU communication have been installed in the FTP-X86n[5,6] nodes.
-
The FTP-A64 cluster has been configured to use the HoreKa file systems for
$HOME
and Workspaces, just like the FTP-X86 cluster already does. The data previously residing in/home
on the FTP-A64 nodes is still available in the path/mnt/oldhomes/
, so users can migrate it on their own. -
The ROCm software stack has been updated to version 4.3.1.
-
The firmware of many components has been updated.
Maintenance 09.09.2021¶
The FTP-X86n[5,6] nodes are now equipped with significantly more powerful AMD EPYC 7543 "Milan" processors. The new CPUs have 32 instead of 16 cores per socket and can execute a total of 128 threads. In addition, the new microarchitecture ("Milan" generation) achieves up to 20% higher performance per core. The distribution of the four GPUs across the two CPU sockets in the nodes has also been optimized during the maintenance.
The batch system partition amd-rome-mi100 has been renamed to amd-milan-mi100 to reflect the upgrade.