File Systems¶

A central aspect in the design of HAICORE has been the enormous amount of data generated by scientific research projects. A multi-level data storage concept guarantees high-throughput processing of data using several different storage systems.

The core of this design are two large-scale, parallel file systems based on IBM Spectrum Scale (also known as GPFS) used for globally visible user data. Individual home directories are automatically created for each user on the Spectrum Scale home file system, and the environment variable $HOME points to these directories. Each user can also create so-called workspaces on the Spectrum Scale work file system.

Other storage locations include a temporary directory called $TMPDIR that is is located on the local solid state disks (SSDs) of a node, and is therefore only visible on an individual node while a job is running. In order to create a temporary directory which is visible on all nodes of a batch job, users can request a temporary BeeGFS On Demand (BeeOND) file system. Access to BeeOND file systems is only possible from the nodes of the batch job and while the job is running.

Users with access to the LSDF can also use the corresponding Spectrum Scale file systems.

The characteristics of the file systems are shown in the following table.

Property	$HOME	workspace	$TMPDIR	BeeOND
Visibility	global	global	local	job local
Lifetime	permanent	limited	job walltime	job walltime
Disk space	2.5 PB	13.5 PB	800 GB	n * 750 GB
Quotas	yes	yes	no	no
Snapshot	yes	yes	no	no
Backup	yes	no	no	no
Total read perf	84 GB/s	196 GB/s	750 MB/s	n * 700 MB/s
Total write perf	60 GB/s	140 GB/s	750 MB/s	n * 700 MB/s
Read perf/node	10 GB/s	10 GB/s	750 MB/s	10 GB/s
Write perf/node	10 GB/s	10 GB/s	750 MB/s	10 GB/s

global : all nodes see the same file system local : each node has its own local file system job local : only available within the currently running job permanent : data is stored permanently (across job runs and reboots) limited : data is stored across job runs and reboots, but will be deleted at some time job walltime : files are removed at end of the batch job.

Selecting the appropriate file system¶

In general, you should separate your data and store it on the appropriate file system.

Permanently required data like software or important results should be stored below $HOME, but capacity limits (so-called "quotas") apply. Permanent data which is not needed for months or exceeds the capacity restrictions should be sent to external large scale storage systems and deleted from the home file system.

Temporary data which is only needed on a single node, which does not exceed the disk space shown in the table above and which is only needed during job runtime should be stored below $TMPDIR. Temporary data which is only needed during job runtime and which needs to be accessed from all nodes of a batch job should be stored on BeeOND. Scratch data which can be easily recomputed or which is the result of one job and input for another job should be stored below so-called workspaces. The lifetime of data in workspaces is limited and depends on the lifetime of the workspace.

Backups

If you accidentally deleted data on $HOME or on a workspace, you can usually copy back an older version from a so-called snapshot path. In addition, for $HOME there is also a the possibility to restore files from a backup. Please see the Backup and Archival section for more information.

$HOME¶

For each user a fixed amount of disk space for the $HOME directory is reserved. The disk space is controlled by so-called quotas. The default quota limit per user is 1 TB and 2 million inodes.

Workspaces¶

Workspaces are directory trees which are available for a limited amount of time (few months). The corresponding Spectrum Scale work file system has no backup, i.e. you should use workspaces for data which can be recreated, e.g. by running the same batch jobs once again. This is only needed in the very unlikely case that the file system gets corrupt.

Initially workspaces have a maximum lifetime of 60 days. You can extend the lifetime 3 times for another 60 days but you should do this near the end of the lifetime since the new lifetime starts when you execute the command which requests the extension.

For your account (user ID) there is a quota limit for all of your workspaces and for the expired workspaces (as long as they are not yet completely removed). The default quota limit per user is 250 TB and 50 million inodes.

Some of the disks of the HoreKa file system

Create workspace¶

To create a workspace you need to state ''name'' of your workspace and ''lifetime'' in days. Note that maximum integer for ''lifetime'' is 60. Execution of:

$ ws_allocate blah 30

returns:

Info: creating workspace.
/hkfs/work/workspace_haic/scratch/USERNAME-blah
remaining extensions  : 3
remaining time in days: 30

For more information read the program's help, i.e. man ws_allocate.

List all your workspaces¶

To list all your workspaces, execute:

$ ws_list

which will return you:

Workspace ID
Workspace location
creation date, remaining time and expiration date
available extensions

Find workspace location¶

Workspace location/path can be prompted for any workspace ''ID'' using ws_find, in case of workspace ''blah'':

$ ws_find blah

returns the one-liner:

/hkfs/work/workspace_haic/scratch/USERNAME-blah

Extend lifetime of your workspace¶

Any workspace's lifetime can be only extended three times. There are two similar commands to extend workspace lifetime:

ws_extend blah 40 which extends workspace ID ''blah'' by ''40'' days from now,
ws_allocate -x blah 40 which extends workspace ID ''blah'' by ''40'' days from now.

Delete a workspace¶

$ ws_release blah # Manually erase your workspace blah

Reminder for workspace deletion¶

By default you will get an email about an expiring workspace 7 days before a workspace expires. You can adapt this time by using the option ''-r '' of ws_allocate.

You can also send yourself a calender entry which reminds you when a workspace will be automatically deleted:

$ ws_send_ical <workspace> <email>

Restoring expired Workspaces¶

At expiration time your workspace will be moved to a special, hidden directory. Expired workspaces are currently kept for 30 days. During this time you can still restore your data into a valid workspace. The same is true for released workspaces but they are only kept until the next night. In order to restore an expired workspace, use

ws_restore -l

to get a list of your expired workspaces, and then restore them into an existing, active workspace (here with name ''my_restored''):

ws_restore <full_name_of_expired_workspace> my_restored

Note: The expired workspace has to be specified using the full name as listed by ws_restore -l, including username prefix and timestamp suffix (otherwise, it cannot be uniquely identified). The target workspace, on the other hand, must be given with just its short name as listed by ws_list, without the username prefix.

Note: ws_restore can only work on the same filesystem. So you have to ensure that the new workspace allocated with ws_allocate is placed on the same filesystem as the expired workspace. Therefore, you can use ''-F '' flag if needed.

Linking workspaces in Home¶

It might be valuable to have links to personal workspaces within a certain directory, e.g. below the user home directory. The command

ws_register DIR

will create and manage links to all personal workspaces within in the directory ''DIR''. Calling this command will do the following:

The directory ''DIR'' will be created if necessary.
Links to all personal workspaces will be managed, i.e. links to all available workspaces will be created if not already present and links to released or expired workspaces will be removed.

If you want to share workspace data with other users or groups you can use Access Control Lists (ACLs). ACLs are a standard for many Linux file systems. Examples how to use them can be found here.

$TMPDIR¶

The environment variable $TMPDIR contains the name of a directory which is located on the local SSD of each node. This means that different tasks of a parallel application use different directories when they do not utilize the same node. Although $TMPDIR points to the same path name for different nodes of a batch job, the physical location and the content of this directory path on these nodes is different.

This directory should be used for temporary files being accessed from the local node during job runtime. It should also be used if you read the same data many times from a single node, e.g. if you are doing AI training. In this case you should copy the data at the beginning of your batch job to $TMPDIR and read the data from there, see usage example below.

The $TMPDIR directory is located on an extremely fast 960 GB NVMe SSD disk. This means that performance on small files is much better than on the parallel file systems.

Each time a batch job is started, a subdirectory is created on the SSD of each node and assigned to the job. $TMPDIR is set to the name of the subdirectory and this name contains the job ID so that it is unique for each job. At the end of the job the subdirectory is removed.

On login nodes $TMPDIR also points to a fast directory on a local NVMe SSD disk but this directory is not unique. It is recommended to create your own unique subdirectory on these nodes. This directory should be used for the installation of software packages. This means that the software package to be installed should be unpacked, compiled and linked in a subdirectory of $TMPDIR. The real installation of the package (e.g. make install) should be made into the $HOME or $PROJECT folder.

Attention

Local storage space is currently not managed as a resource via SLURM. Therefore, if several users or jobs want to use the local hard disk on the same node, the desired storage space may not be sufficient.
If you want to work with large amounts of data on $TMPDIR, which is close to the capacity of the local SSDs, you should allocate the compute node exclusively with the --exclusive flag.

Usage example¶

We will provide an example for using $TMPDIR and describe efficient data transfer to and from $TMPDIR.

If you have a data set with many files which is frequently used by batch jobs you should create a compressed archive on a workspace. This archive can be extracted on $TMPDIR inside your batch jobs. Such an archive can be read efficiently from a parallel file system since it is a single huge file. On a login node you can create such an archive with the following steps:

## Create a workspace to store the archive
$ ws_allocate data-ssd 60
## Create the archive from a local dataset folder (example)
$ tar -cvzf $(ws_find data-ssd)/dataset.tgz dataset/

Inside a batch job extract the archive on $TMPDIR, read input data from $TMPDIR, store results on $TMPDIR and save the results on a workspace:

#!/bin/bash
# very simple example on how to use local $TMPDIR
#SBATCH -N 1
#SBATCH -t 24:00:00

# Extract compressed input dataset on local SSD
tar -C $TMPDIR/ -xvzf $(ws_find data-ssd)/dataset.tgz

# The application reads data from dataset on $TMPDIR and writes results to $TMPDIR
myapp -input $TMPDIR/dataset/myinput.csv -outputdir $TMPDIR/results

# Before job completes save results on a workspace
rsync -av $TMPDIR/results $(ws_find data-ssd)/results-${SLURM_JOB_ID}/

/tmp or /scratch

Do not use /tmp or /scratch and use $TMPDIR instead. The reason is that an automatic cleanup on /tmp or /scratch is not possible because another job could be still using data below these directories. Hence the corresponding file systems could fill up and this can cause issues for you and for other users. On the other hand, $TMPDIR is created when the job starts and removed when the job completes, i.e. a cleanup is automatically done.

LSDF Online Storage¶

Users of the LSDF Online Storage can access the storage on HoreKa. Therefore the environment variables $LSDF, $LSDFPROJECTS and $LSDFHOME are set.

The LSDF Online Storage is available on all Login- and Compute Nodes at all points in time. In case of maintenance the jobs will not be started, if the LSDF Constraint batch job parameter is used. For details see here.

BeeOND (BeeGFS On-Demand)¶

Users of the cluster HoreKa can request a private BeeOND (BeeGFS) parallel filesystem for each job. The file system is created during job startup and purged after your job.

Attention

All data on the private BeeOND filesystem will be deleted after your job. Make sure you have copied your data back within your job to the global filesystem, e.g. $HOME, $PROJECT, any workspace or the LSDF.

BeeOND/BeeGFS can be used like any other parallel file system. Tools like cp or rsync can be used to copy data in and out.

A BeeOND file system is only created if your batch job requests this creation. For details see here.

Snapshots and backup¶

In case you inadvertently deleted some of your data, want to go back to a previous version or compare your data with a previous version you can use so-called snapshot. Snapshots are a point in time copy of your data. For the home file system there are snapshots of the last 7 days, of the last 4 weeks and of the last 6 months. For the workspaces there are snapshots of the last 7 days and of the last 4 weeks. For the home file system snapshots will be located below /home/<group>/.snapshots. For the work (workspace) file system snapshots will be located below /hkfs/work/.snapshots.

There are also regular backups of all data of the project directories, whereas ACLs and extended attributes will not be saved by the backup. Please open a support ticket if you need us to restore backup data.

Quotas¶

To display your used quotas and quota limits of $HOME just execute the following command on a login node:

$ /usr/lpp/mmfs/bin/mmlsquota -u $USER  --block-size G -C hkn.scc.kit.edu hkfs-home:$PROJECT_GROUP

The current quota usage and limits of your account on the workspace file system can be displayed with the command

$ /usr/lpp/mmfs/bin/mmlsquota -u $(whoami) --block-size G -C hkn.scc.kit.edu hkfs-work

File system performance tuning¶

Hints on file system performance tuning can be found here.