JobMon - Job Performance Metrics¶

With JobMon we offer a web based performance service. It visualizes selected performance metrics collected on cluster nodes during runtime of a users job. Metrics are stored for at least 4 weeks, so that performance changes over time can be tracked.

Benchmarks¶

In this documentation we will demonstrate JobMon use cases by benchmarks from three different categories:

Compute Bound¶

Performance is limited by the speed of the CPU, this typically means it is limited by the number of floating point operations. Benchmarks from this category include:

DGEMM

Benchmark DGEMM performs a matrix matrix multiplication C = A · B. For n dimensional square matrixes computation requires O(n³) compute operations but only O(n²) memory operations. There are implementations for CPU and GPU available for this benchmark.

Memory bound¶

Performance is limited by the speed of the main memory subsystem. This can be caused by memory bandwidth saturation, or memory access latency. Benchmarks from this category include:

Stream / BabelStream

Benchmark Stream performs four sub-benchmarks: vector copy c = a, vector scale b = α · c, vector add c = a + b and vector triad a = b + α · c. For n dimensional vectors computation requires O(n) memory operations but only O(n) compute operations. There are implementations for CPU and GPU available for this benchmark.
High Performance Conjugate Gradients (HPCG)

HPCG performs the conjugate gradients method with a sparse matrix. There are implementations for CPU and GPU available for this benchmark.

Communication bound¶

Performance is limited by the node to node communication network. This can be caused by interconnect bandwidth saturation, or communication package latency. Benchmarks from this category include:

OSU Micro-Benchmarks (OMB)

OMB performs point to point communication with messages of increasing size to measure bandwidth and latency. There are implementations for CPU and GPU available for this benchmark.

Jobs page¶

After logging in to the JobMon service, you will be automatically redirected to the Jobs page.

This page presents an overview of the individual batch jobs that have been executed on the HoreKa cluster. These Jobs can be filter by:

the partition the job was running
the number of nodes or number of GPUs used by the job
still running or already finished jobs
the time the job was running (e.g. jobs of last week)
tags assigned to the job (e.g. tags: "without optimization", "optimization A", "optimization B", ...)

Job overview: Filter job data

The Jobs page includes for each job a so called spider plot. This plot very easily shows the performance limitations of this job and allows categorization as memory bound, compute bound, or communication bound. For this purpose, the plot shows the average and maximum values for the metrics:

CPU floating point operations per second
CPU memory bandwidth
GPU utilization
GPU memory utilization
InfiniBand bandwidth

Examples: Spider Plot

CPU compute bound job (DGEMM)CPU memory bound job (Stream)CPU communication bound job (OMB)GPU compute bound job (DGEMM)GPU memory bound job (HPCG)GPU communication bound (OMB)

Job Overview for Compute Bound CPU-Job

DGEMM is capable of using the entire available floating point power of the node. This also uses a significant portion of the memory bandwidth.

Job Overview for Memory Bound CPU-Job

Stream saturates the entire memory bandwidth. All other resources are under-utilized.

Job Overview for communication Bound CPU-Job

OSU Micro-Benchmarks saturates the entire InfiniBand bandwidth. Some memory bandwidth utilization can also be seen as messages are transferred from memory to memory.

Job Overview for Compute Bound GPU-Job

The DGEMM implementation uses only a single GPU. Therefore, only a quarter of the available floating point performance of the GPUs is utilized. This is reflected in the use of memory bandwidth as well.

Job Overview for Memory Bound GPU-Job

HPCG runs distributed on all GPUs of the node. HPCG saturates the entire memory bandwidth of these GPUs. The compute units are also well utilized.

Job Overview for communication Bound GPU-Job

InfiniBand communication between GPUs is not as performant as between CPUs. Therefore, OSU Micro-Benchmarks cannot utilize the entire available InfiniBand bandwidth.

Per job page¶

Configuration options¶

Select subset of nodes: Show only nodes of interest
Select time range: Zoom into time ranges of interest
Select subset of metrics to focus on metrics of interest
Set Tag

Tags are a handy tool to mark jobs during different optimization steps (e.g. tags: "without optimization", "optimization A", "optimization B", ...) or jobs from different workflows (e.g. tags: "workflow A", "workflow B"). Filtering for tags on the jobs page allows easily finding all jobs with the same tag.
Toggle for Automatic Scaling:

Select diagram y-axis upper and lower limits depending on the measured metric values
Toggle for Changepoints

Change point detection tries to identify times where performance metric behavior changes. If changes are identified, they are displayed as a vertical line in the diagram.

Examples: Changepoints

InfiniBand packets (Toggle for Changepoints off)InfiniBand packets (Toggle for Changepoints on)

OMB performs communication with messages of increasing size. The performance changes caused by increasing message size can only be seen as steps in the graph.

OMB performs communication with messages of increasing size. Some of time points with changing performance caused by increasing message size are detected by the changepoint algorithm and marked by vertical lines
Toggle for Quantile view
- Quantile view is a useful feature to improve clarity when too many graphs are displayed in one diagram.
- Information is condensed into only three graphs (25% / 50% / 75% Quantile = Quartiles)
- The 25% Quantile is the graph for which 25% of the measured values are below the graph
- The 50% Quantile is also known as median
- The difference between upper and lower Quantile is a measure for the spread of the metrics
Examples: Quantile View

Temperature (Toggle for Quantile view off)Temperature (Toggle for Quantile view on)CPU Frequency (off)CPU Frequency (on)

The CPU temperatures are collected per hardware thread, therefore the diagram appears very cluttered

The Quantile View shows the distribution of the CPU temperatures much clearer

The CPU frequencies are collected per CPU core, therefore the diagram appears very cluttered

The Quantile View shows the distribution of the CPU frequencies much clearer

Performance categories¶

Metrics are grouped into different performance categories

Energy¶

This category offers diagrams for:

CPU power consumption of DRAM channels and the package
GPU power consumption
Server system power consumption

Examples: Category Energy

Power DRAM (Stream)Power DRAM (DGEMM)Power CPU (HPCG)Power GPU (HPCG)

Jobs: CPU Power DRAM for benchmark stream

Stream as a memory bound benchmark put constantly high pressure on the DRAM subsystem. This is reflected by a constantly high energy consumption of this subsystem.

Jobs: CPU Power DRAM for benchmark DGEMM

DGEMM as a compute bound benchmark does put less pressure on the DRAM subsystem. This is reflected in the varying power consumption of this subsystem over time.

Jobs: CPU Power Package for benchmark DGEMM

The GPU implementation of the HPCG benchmark has a preparation phase executed on the CPU and a computation phase executed on the GPU. During the first phase the power consumption of the CPU package is therefore higher than in the subsequent phase.

Jobs: CPU Power Package for benchmark stream

The GPU implementation of the HPCG benchmark has a preparation phase executed on the CPU and a computation phase executed on the GPU. During the first phase the power consumption of the GPUs is therefore lower than in the subsequent phase.

Filesystem¶

This category offers graphs for e.g. meta data operation, IO throughput, ...

Interconnect¶

This category offers diagrams for:

InfiniBand: Send, receive and aggregated bandwidth
InfiniBand: Send, received and aggregated number of packages

Examples: Category Interconnect

InfiniBand Send Bandwidth (OMB)InfiniBand Receive Bandwidth (OMB)InfiniBand Bandwidth (OMB)

Jobs: InfiniBand send bandwidth for OSU Micro-Benchmark

InfiniBand send bandwidth for OSU Micro-Benchmark. In the point to point communication node hkn0201 send data while node hkn0202 only receives data (send bandwidth is zero).

Jobs: InfiniBand receive bandwidth for OSU Micro-Benchmark

InfiniBand receive bandwidth for OSU Micro-Benchmark. In the point to point communication node hkn0202 receives data while node hkn0201 only sends data (receive bandwidth is zero).

Jobs: InfiniBand bandwidth for OSU Micro-Benchmark

InfiniBand bandwidth for OSU Micro-Benchmark. In the point-to-point communication, both nodes use the same bandwidth, even though one of the nodes only sends data and the other one only receives data.

Memory¶

This category offers diagrams for:

Amount of memory used on the system (CPU) and on the GPU
CPU memory bandwidth
GPU memory utilization (in %) and frequency

Examples: Category Memory

CPU Memory Bandwidth (Stream)CPU Memory Bandwidth (DGEMM)GPU Memory Utilization (BabelStream)GPU Memory Utilization (DGEMM)

Jobs: CPU Memory Bandwidth for Benchmark Stream

Stream as a memory bound benchmark put constantly high pressure on the memory subsystem.

Jobs: CPU Memory Bandwidth for Benchmark DGEMM

DGEMM as a compute bound benchmark does put less pressure on the memory subsystem. This is reflected in the varying bandwidth over time.

Jobs: GPU Memory Utilization for Benchmark BabelStream

Babelstream as a memory bound benchmark fully utilizes the memory subsystem of the GPU.

Jobs: GPU Memory Utilization for Benchmark DGEMM

The GPU implementation of DGEMM as a compute bound benchmark does put less pressure on the GPU memory subsystem. This is reflected in the varying utilization over time.

Performance¶

This category offers diagrams for:

Floating point operation per second (FLOPs), collected per hardware thread and aggregated per core or per socket
Instructions per cycle (IPC), collected per hardware thread and aggregated per core or per socket
CPU time spend in kernel and in user space
One Minute Linux load average
GPU utilization
CPU and GPU frequency

Examples: Category Performance

FLOP/s per Core (DGEMM)FLOP/s per Socket (DGEMM)

Jobs:

Floating point operation per second (FLOP/s) are collected per hardware thread. As hardware threads of a core share the same compute units it may be more appropriated to aggregated per core.

Jobs:

Floating point operation per second (FLOP/s) are collected per hardware thread. To examine the even utilization of the CPU sockets, an aggregation per socket can be beneficial.

Temperature¶

This category offers diagrams for:

CPU and GPU temperature

Examples: Category Temperature

GPU Temperature (DGEMM)GPU Temperature (HPCG)

Jobs:

The GPU implementation of the DGEMM benchmark only utilizes one of the GPUs. Only this GPU gets hot, while the other GPUs maintain a lower temperature.

Jobs:

The GPU implementation of the HPCG benchmark has a preparation phase executed on the CPU and a computation phase executed on the GPU. During the first phase, the GPU is not utilized and therefore stays cooler than in the subsequent phase.

Additional Features¶

For multi-node jobs there is a configuration option to select the per-node aggregation function used (e.g. average, sum, maximum)
Live view of still running jobs
Download of metrics as CSV file
- All metrics as measured by the collector
- Ready for use e.g. in spread sheet application or Python
Outlook
- For future versions of JobMon it is planned to automatically analyze each job and assign tags for detected characteristics