Gaudi 2¶
The official documentation for Intel Gaudi can be found at https://docs.habana.ai/en/latest/index.html
Hardware Overview¶
The system management interface tool hl-smi
aids in the management and monitoring of the Gaudi accelerators.
Running hl-smi
without an Options argument set displays a summary table of the detected Gaudi devices:
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.19.1-fw-57.2.2.0 |
| Driver Version: 1.19.1-6f47ddd |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:19:00.0 N/A | 0 |
| N/A 27C N/A 82W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 26C N/A 73W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:1a:00.0 N/A | 0 |
| N/A 28C N/A 75W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:b4:00.0 N/A | 0 |
| N/A 28C N/A 79W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:43:00.0 N/A | 0 |
| N/A 28C N/A 81W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:cc:00.0 N/A | 0 |
| N/A 28C N/A 73W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:44:00.0 N/A | 0 |
| N/A 27C N/A 82W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:cd:00.0 N/A | 0 |
| N/A 28C N/A 77W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
Using Pytorch¶
Intel provides a custom Intel Gaudi PyTorch environment which is optimized for Intel Gaudi AI accelerator. The necessary software is preinstalled on the nodes and can be loaded using the Lmod module system. In order to avoid ambiguities, it is advisable to load architecture-specific modules on the compute nodes, working in an allocation created with salloc on the requested node:
salloc -p gaudi2 -t 01:00:00
We can then load the pytorch module:
module purge
module load toolkit/gaudi-torch
Afterwards a python program can be started via:
python torch_example.py
Code examples and more information can be found here