Skip to content

Gaudi 2

The official documentation for Intel Gaudi can be found at https://docs.habana.ai/en/latest/index.html

Hardware Overview

The system management interface tool hl-smi aids in the management and monitoring of the Gaudi accelerators.

Running hl-smi without an Options argument set displays a summary table of the detected Gaudi devices:

+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.19.1-fw-57.2.2.0          |
| Driver Version:                                     1.19.1-6f47ddd          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncor-Events|
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:19:00.0     N/A |                   0  |
| N/A   27C   N/A  82W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   1  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
| N/A   26C   N/A  73W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   2  HL-225              N/A  | 0000:1a:00.0     N/A |                   0  |
| N/A   28C   N/A  75W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   3  HL-225              N/A  | 0000:b4:00.0     N/A |                   0  |
| N/A   28C   N/A  79W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   4  HL-225              N/A  | 0000:43:00.0     N/A |                   0  |
| N/A   28C   N/A  81W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   5  HL-225              N/A  | 0000:cc:00.0     N/A |                   0  |
| N/A   28C   N/A  73W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   6  HL-225              N/A  | 0000:44:00.0     N/A |                   0  |
| N/A   27C   N/A  82W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   7  HL-225              N/A  | 0000:cd:00.0     N/A |                   0  |
| N/A   28C   N/A  77W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
|   1        N/A   N/A    N/A                                      N/A        |
|   2        N/A   N/A    N/A                                      N/A        |
|   3        N/A   N/A    N/A                                      N/A        |
|   4        N/A   N/A    N/A                                      N/A        |
|   5        N/A   N/A    N/A                                      N/A        |
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+

Using Pytorch

Intel provides a custom Intel Gaudi PyTorch environment which is optimized for Intel Gaudi AI accelerator. The necessary software is preinstalled on the nodes and can be loaded using the Lmod module system. In order to avoid ambiguities, it is advisable to load architecture-specific modules on the compute nodes, working in an allocation created with salloc on the requested node:

salloc -p gaudi2 -t 01:00:00

We can then load the pytorch module:

module purge
module load toolkit/gaudi-torch

Afterwards a python program can be started via:

python torch_example.py

Code examples and more information can be found here