GPU Offloading¶
The accelerated partition of the HoreKa cluster consists of nodes that draw their computing power mainly from GPUs. In order to be able to use these, the computing operations must be transferred from the host CPU to the GPUs, which is then referred to as offloading. In recent years, four main programming approaches have emerged that can be used to perform this offloading. The following sections show how these can be used on HoreKa.
In order for the compilers to apply the appropriate optimizations, it is recommended to compile directly on the target architecture. I.e. it usually makes sense to compile programs with GPU offloading on compute nodes equipped with GPUs. On the HoreKa cluster you can start an interactive job on the GPU development partition for this purpose:
salloc -p dev_accelerated --gres=gpu:1 -t 60
OpenACC¶
OpenACC is a directive-based parallel programming model. The specification supports the C, C++ and Fortran programming languages. For more information visit the OpenACC homepage to download additional resources:
The fundamental idea of an OpenACC program is to execute all compute-intensive parts on the accelerator. Thereby, data transfers between host memory and device memory should be avoided as far as possible. The host CPU can execute instructions asynchronously to the accelerator. In many cases, however, it is easier to have the computation run exclusively on the accelerator and to use the CPU only for auxiliary tasks
The following simple C program shows the typical sequence of an OpenACC program:
#include <openacc.h>
#include <stdlib.h>
#include <err.h>
int main(int arg, char *argv[]) {
const unsigned int size = 1 << 16;
// Allocate memory on the host
double *a = (double *) malloc(size * sizeof(double));
if (a == NULL) {
errx(1, "malloc a[] failed");
}
// Preprocess / initialize data on the host
// e.g. read data from storage
for (int i = 0; i < size; i++) {
a[i] = 0.;
}
// Copy data from the host to the device
#pragma acc enter data copyin(a[0:size])
{}
// Compute on the device
#pragma acc parallel loop present(a[0:size]) wait
for (int i = 0; i < size; i++) {
a[i]++;
}
// Transfer data back from the device to the host
#pragma acc update host(a[0:size])
{}
// Delete data on the device
#pragma acc exit data delete(a[0:size])
{}
// Postprocess data on the host
// e.g. write data to storage
for (int i = 0; i < size; i++) {
if (a[i] != 1.) {
errx(2, "Computation on GPU failed");
}
}
// Free memory on the host
free(a);
return 0;
}
OpenACC programs can be compiled with different compilers. The procedure for this is as follows:
- GNU Compiler Collection
# Load GNU compiler environment module add \ compiler/gnu # Compile C, C++ or FORTRAN source code with OpenACC support gcc ... -fopenacc ${C_SOURCE} -o ${EXECUTABLE} g++ ... -fopenacc ${CXX_SOURCE} -o ${EXECUTABLE} gfortran ... -fopenacc ${F90_SOURCE} -o ${EXECUTABLE}
- NVIDIA High Performance Computing (HPC) SDK
# Load NVIDIA HPC SDK environment module add toolkit/nvidia-hpc-sdk # Compile C, C++ or FORTRAN source code with OpenACC support nvc ... -acc -target=gpu ${C_SOURCE} -o ${EXECUTABLE} nvc++ ... -acc -target=gpu ${CXX_SOURCE} -o ${EXECUTABLE} nvfortran ... -acc -target=gpu ${F90_SOURCE} -o ${EXECUTABLE}
OpenMP¶
OpenMP is a directive-based parallel programming model. The specification supports the C, C++ and Fortran programming languages. OpenMP originally targeted shared memory systems and added support for accelerators such as GPUs in recent versions. For more information visit the OpenMP homepage to download additional resources:
The fundamental idea of an OpenMP program is to execute all compute-intensive parts on the accelerator. Thereby, data transfers between host memory and device memory should be avoided as far as possible. The host CPU can execute instructions asynchronously to the accelerator. In many cases, however, it is easier to have the computation run exclusively on the accelerator and to use the CPU only for auxiliary tasks
The following simple C program shows the typical sequence of an OpenMP program:
#include <omp.h>
#include <stdlib.h>
#include <err.h>
int main(int arg, char *argv[]) {
const unsigned int size = 1 << 16;
// Allocate memory on the host
double *a = (double *) malloc(size * sizeof(double));
if (a == NULL) {
errx(1, "malloc a[] failed");
}
// Preprocess / initialize data on the host
// e.g. read data from storage
for (int i = 0; i < size; i++) {
a[i] = 0.;
}
// Copy data from the host to the device
#pragma omp target enter data map(to: a[0:size])
{}
// Compute on the device
#pragma omp target teams distribute parallel for simd
for (int i = 0; i < size; i++) {
a[i]++;
}
// Transfer data back from the device to the host
#pragma omp target update from(a[0:size])
{}
// Delete data on the device
#pragma omp target exit data map(release: a[0:size])
{}
// Postprocess data on the host
// e.g. write data to storage
for (int i = 0; i < size; i++) {
if (a[i] != 1.) {
errx(2, "Computation on GPU failed");
}
}
// Free memory on the host
free(a);
return 0;
}
OpenMP programs can be compiled with different compilers. The procedure for this is as follows:
- GNU Compiler Collection
# Load GNU compiler environment module add \ compiler/gnu # Compile C, C++ or FORTRAN source code with OpenMP support gcc ... -fopenmp ${C_SOURCE} -o ${EXECUTABLE} g++ ... -fopenmp ${CXX_SOURCE} -o ${EXECUTABLE} gfortran ... -fopenmp ${F90_SOURCE} -o ${EXECUTABLE}
- LLVM Compiler
# Load LLVM compiler and CUDA environment module add \ compiler/llvm \ devel/cuda # Compile C or C++ source code with OpenMP support clang ... -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=${CUDA_PATH} --offload-arch=sm_80 -L${LLVM_LIB_DIR} ${C_SOURCE} -o ${EXECUTABLE} clang++ ... -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=${CUDA_PATH} --offload-arch=sm_80 -L${LLVM_LIB_DIR} ${CXX_SOURCE} -o ${EXECUTABLE}
- NVIDIA High Performance Computing (HPC) SDK
# Load NVIDIA HPC SDK environment module add \ toolkit/nvidia-hpc-sdk # Compile C or C++ source code with OpenMP support nvc ... -mp -target=gpu ${C_SOURCE} -o ${EXECUTABLE} nvc++ ... -mp -target=gpu ${CXX_SOURCE} -o ${EXECUTABLE} nvfortran ... -mp -target=gpu ${F90_SOURCE} -o ${EXECUTABLE}
CUDA¶
CUDA is collection of accelerated libraries and extensions to the C, C++ and Fortran programming languages.
For more information visit the CUDA homepage to read additional resources:
#include <stdlib.h>
#include <err.h>
__global__
void inc_kernel(double *a) {
const int i = blockDim.x * blockIdx.x + threadIdx.x;
a[i]++;
}
int main(int arg, char *argv[]) {
const unsigned int size = 1 << 16;
// Allocate memory on the host
double *a = (double *) malloc(size * sizeof(double));
if (a == NULL) {
errx(1, "malloc a[] failed");
}
// Allocate memory on the device
double *d_a;
if (cudaMalloc(&d_a, size * sizeof(double)) != cudaSuccess) {
errx(1, "cudaMalloc d_a[] failed");
}
// Preprocess / initialize data on the host
// e.g. read data from storage
for (int i = 0; i < size; i++) {
a[i] = 0.;
}
// Copy data from the host to the device
cudaMemcpy(d_a, a, size * sizeof(double), cudaMemcpyHostToDevice);
// Compute on the device
inc_kernel<<<size/256, 256>>>(d_a);
// Transfer data back from the device to the host
cudaMemcpy(a, d_a, size * sizeof(double), cudaMemcpyDeviceToHost);
// Delete data on the device
cudaFree(d_a);
// Postprocess data on the host
// e.g. write data to storage
for (int i = 0; i < size; i++) {
if (a[i] != 1.) {
errx(2, "Computation on GPU failed");
}
}
// Free memory on the host
free(a);
return 0;
}
CUDA programs can be compiled with different compilers. The procedure for this is as follows:
- LLVM Compiler
# Load LLVM compiler and CUDA environment module add \ compiler/llvm \ devel/cuda # Compile C or C++ source code with CUDA support clang ... --cuda-gpu-arch=sm_80 ${C_SOURCE} -o ${EXECUTABLE} -lcudart clang++ ... --cuda-gpu-arch=sm_80 ${CXX_SOURCE} -o ${EXECUTABLE} -lcudart
- NVIDIA High Performance Computing (HPC) SDK
# Load NVIDIA HPC SDK environment module add \ toolkit/nvidia-hpc-sdk # Compile C or C++ source code with CUDA support nvcc ... ${C_SOURCE} -o ${EXECUTABLE} nvc++ ... ${CXX_SOURCE} -o ${EXECUTABLE}
OpenCL¶
OpenCL is a low-level programming framework with direct control where code is executed. Accelerator code, so called kernels, are usually written in OpenCL C, which is based on C99. Host code is typically written using C or C++.
For more information visit the Khronos OpenCL homepage to read additional resources:
#include <CL/opencl.h>
#include <stdlib.h>
#include <err.h>
const char *incSource = "\n" \
"__kernel void inc( \n" \
" __global double* a_d, \n" \
" const unsigned int size) { \n" \
" int i = get_global_id(0); \n" \
" if(i < size) \n" \
" a_d[i]= a_d[i] + 1.0; \n" \
"} \n";
int main(int arg, char *argv[]) {
const unsigned int size = 1 << 16;
cl_int errCode;
// Obtain the first available platform.
cl_platform_id platformID = NULL;
cl_uint numPlatforms;
errCode = clGetPlatformIDs(1, &platformID, &numPlatforms);
if (errCode != CL_SUCCESS) {
errx(1, "clGetPlatformIDs() failed");
}
// Obtain the first available device on the platform
cl_device_id deviceID = NULL;
cl_uint numDevices;
errCode = clGetDeviceIDs( platformID, CL_DEVICE_TYPE_DEFAULT, 1,
&deviceID, &numDevices);
if (errCode != CL_SUCCESS) {
errx(1, "clGetDeviceIDs() failed");
}
// Create an OpenCL context
cl_context context = clCreateContext( NULL, 1, &deviceID, NULL, NULL, &errCode);
if (errCode != CL_SUCCESS) {
errx(1, "clCreateContext() failed");
}
// Create a command queue
cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &errCode);
if (errCode != CL_SUCCESS) {
errx(1, "clCreateCommandQueue() failed");
}
// Allocate memory on the host
double *a = (double *) malloc(size * sizeof(double));
if (a == NULL) {
errx(1, "malloc a[] failed");
}
// Allocate memory on the device
cl_mem a_d = clCreateBuffer(context, CL_MEM_READ_WRITE, size * sizeof(double), NULL, &errCode);
if (errCode != CL_SUCCESS) {
errx(1, "clCreateBuffer() failed");
}
// Preprocess / initialize data on the host
// e.g. read data from storage
for (int i = 0; i < size; i++) {
a[i] = 0.;
}
// Copy data from the host to the device
errCode = clEnqueueWriteBuffer(commandQueue, a_d, CL_TRUE, 0, size * sizeof(double), a, 0, NULL, NULL);
if (errCode != CL_SUCCESS) {
errx(1, "clEnqueueWriteBuffer() failed");
}
//
// Compute on the device
//
// Creates a program object for a context, and loads source code specified by text strings into the program object
cl_program program = clCreateProgramWithSource(context, 1, & incSource, NULL, &errCode);
if (errCode != CL_SUCCESS) {
errx(1, "clCreateProgramWithSource() failed");
}
// Builds (compiles and links) a program executable from the program source
errCode = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);
if (errCode != CL_SUCCESS) {
size_t len;
char buffer[2048];
clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len);
errx(1, "clBuildProgram() failed:\n%s", buffer);
}
// Creates a kernel object
cl_kernel kernel = clCreateKernel(program, "inc", &errCode);
if (errCode != CL_SUCCESS) {
errx(1, "clCreateKernel() failed");
}
// Set the argument value for a specific argument of a kernel
errCode = clSetKernelArg(kernel, 0, sizeof(cl_mem) , &a_d);
if (errCode != CL_SUCCESS) {
errx(1, "clSetKernelArg() failed");
}
errCode = clSetKernelArg(kernel, 1, sizeof(unsigned int), &size);
if (errCode != CL_SUCCESS) {
errx(1, "clSetKernelArg() failed");
}
// Query the maximum workgroup size
size_t local;
errCode = clGetKernelWorkGroupInfo(kernel, deviceID, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL);
if (errCode != CL_SUCCESS) {
errx(1, "clGetKernelWorkGroupInfo() failed");
}
// Enqueues a command to execute a kernel on a device
size_t global = size;
errCode = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
if (errCode != CL_SUCCESS) {
errx(1, "clEnqueueNDRangeKernel() failed");
}
// Wait for command completion
errCode = clFinish(commandQueue);
if (errCode != CL_SUCCESS) {
errx(1, "clFinish() failed");
}
// Release the kernel object
errCode = clReleaseKernel(kernel);
// Release the program object
errCode = clReleaseProgram(program);
// Release the device
errCode = clReleaseDevice(deviceID);
// Transfer data back from the device to the host
errCode = clEnqueueReadBuffer(commandQueue, a_d, CL_TRUE, 0, size * sizeof(double), a, 0, NULL, NULL);
if (errCode != CL_SUCCESS) {
errx(1, "clEnqueueReadBuffer() failed");
}
// Delete data on the device
errCode = clReleaseMemObject(a_d);
if (errCode != CL_SUCCESS) {
errx(1, "clReleaseMemObject() failed");
}
// Release a command queue
errCode = clReleaseCommandQueue(commandQueue);
// release the context
errCode = clReleaseContext(context);
// Postprocess data on the host
// e.g. write data to storage
for (int i = 0; i < size; i++) {
if (a[i] != 1.) {
errx(2, "Computation on GPU failed");
}
}
// Free memory on the host
free(a);
return 0;
}
OpenCL programs can be compiled with different compilers. The procedure for this is as follows:
- GNU Compiler Collection
# Load GNU compiler and CUDA environment module add \ compiler/gnu \ devel/cuda # Compile C or C++ source code with OpenCL support gcc ... ${C_SOURCE} -o ${EXECUTABLE} -lOpenCL g++ ... ${CXX_SOURCE} -o ${EXECUTABLE} -lOpenCL
- Intel Compiler
# Load Intel compiler and CUDA environment module add \ compiler/intel \ devel/cuda # Compile C or C++ source code with OpenCL support icc ... ${C_SOURCE} -o ${EXECUTABLE} -lOpenCL icpc ... ${CXX_SOURCE} -o ${EXECUTABLE} -lOpenCL
- LLVM Compiler
# Load LLVM compiler and CUDA environment module add \ compiler/llvm \ devel/cuda # Compile C or C++ source code with OpenCL support clang ... ${C_SOURCE} -o ${EXECUTABLE} -lOpenCL clang++ ... ${CXX_SOURCE} -o ${EXECUTABLE} -lOpenCL
GPU Offloading and MPI¶
MPI is a standardized and portable message-passing library for distributed processes. Both point-to-point and collective communication are supported. All communication takes place between the private memory spaces of the distributed processes. Therefore, all data must be copied from the GPU to this private memory area prior to communication. With so-called CUDA aware MPI implementations, this recopying is not required, transfers can be performed directly from the GPU memory.
See:
The following simple MPI program demonstrates host to host, host to device and device to device MPI point to point communication:
#include <mpi.h>
#include <stdlib.h>
#include <err.h>
#include "cuda.h"
#include "cuda_runtime.h"
int main(int argc, char* argv[]) {
// Initialize the MPI execution environment
MPI_Init(&argc, &argv);
// Get the size of the group associated with communicator MPI_COMM_WORLD
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the calling process in the communicator MPI_COMM_WORLD
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Allocate memory on the host
int size = 1000;
double *a = (double *) malloc(size * sizeof(double));
if (a == NULL) {
errx(1, "malloc a[] failed");
}
// Allocate memory on the device
double *d_a;
if (cudaMalloc((void **) &d_a, size * sizeof(double)) != cudaSuccess) {
errx(1, "cudaMalloc d_a[] failed");
}
// Initalize host memory
for (int i = 0; i < size; i++) {
a[i] = (double) world_rank;
}
MPI_Status status;
// From [0],CPU to [1],GPU
if (world_rank == 0) {
MPI_Send( a, size, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD);
}
else if (world_rank == 1) {
MPI_Recv(d_a, size, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, &status);
}
// From [1],GPU to [0],GPU
if (world_rank == 1) {
MPI_Send(d_a, size, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD);
}
else if (world_rank == 0) {
MPI_Recv(d_a, size, MPI_DOUBLE, 1, 2, MPI_COMM_WORLD, &status);
}
// From [0],GPU to [1],CPU
if (world_rank == 0) {
MPI_Send(d_a, size, MPI_DOUBLE, 1, 3, MPI_COMM_WORLD);
}
else if (world_rank == 1) {
MPI_Recv( a, size, MPI_DOUBLE, 0, 3, MPI_COMM_WORLD, &status);
}
// From [1],CPU to [0],CPU
if (world_rank == 1) {
MPI_Send( a, size, MPI_DOUBLE, 0, 4, MPI_COMM_WORLD);
}
else if (world_rank == 0) {
MPI_Recv( a, size, MPI_DOUBLE, 1, 4, MPI_COMM_WORLD, &status);
}
// Check host memory
for (int i = 0; i < size; i++) {
if (a[i] != 0.) {
errx(2, "MPI transport failed");
}
}
// Terminates MPI execution environment
MPI_Finalize();
// Delete data on the device
cudaFree(d_a);
// Free memory on the host
free(a);
}
MPI programs can be compiled with different compilers. The procedure for this is as follows:
- GNU Compiler Collection, OpenMPI
# Load GNU compiler, OpenMPI and CUDA environment module add \ compiler/gnu \ mpi/openmpi \ devel/cuda # Compile C or C++ source code with OpenMP support mpicc ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart mpicxx ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
- LLVM Compiler, OpenMPI
# Load LLVM compiler, OpenMPI and CUDA environment module add \ compiler/llvm \ mpi/openmpi \ devel/cuda # Compile C or C++ source code with OpenMP support mpicc ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart mpicxx ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
- NVIDIA High Performance Computing (HPC) SDK
# Load NVIDIA HPC SDK environment module add toolkit/nvidia-hpc-sdk # Compile C or C++ source code with OpenMP support mpicc ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart mpicxx ... ${C_SOURCE} -o ${EXECUTABLE} -lcudart
The start of an MPI program with accelerator support is handled as usual with mpirun
. In the accelerated partition of the HoreKa cluster, 2 CPUs each with 38 cores and 4 GPUs are available per node. To distribute the CPUs evenly to the GPUs, you can proceed as follows:
mpirun \
--display-map \
--display-allocation \
--map-by ppr:2:socket:pe=19 \
--bind-to core \
bash -c \
'export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK};
${EXECUTABLE}'