Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO

NVIDIA DOCA GPUNetIO is a library within the NVIDIA DOCA SDK, specifically designed for real-time inline GPU packet processing. It combines technologies like GPUDirect RDMA and GPUDirect Async to enable the creation of GPU-centric applications where a CUDA kernel can directly communicate with the network interface card (NIC) for sending and receiving packets, bypassing the CPU and excluding it from the critical path.

The core principles and uses of DOCA GPUNetIO have been discussed in the previous posts, Inline GPU Packet Processing with NVIDIA DOCA GPUNetIO and Realizing the Power of Real-Time Network Processing with NVIDIA DOCA GPUNetIO, as well as in the DOCA GPUNetIO programming guide.

Previously, DOCA GPUNetIO, alongside DOCA Ethernet and DOCA Flow, was limited to handling packet transmissions over the Ethernet transport layer. With the introduction of DOCA 2.7, an expanded set of APIs now enables DOCA GPUNetIO to support RDMA communications directly from a GPU CUDA kernel using RoCE or InfiniBand transport layers.

This post explores the new Remote Direct Memory Access (RDMA) functionalities controlled by a GPU CUDA kernel with DOCA GPUNetIO and presents a performance comparison with the performance test (perftest) microbenchmarks.

Note that the RDMA acronym describes the protocol to enable remote direct memory access from the memory of one computer into that of another without involving the operating system of either one. Examples of operations include RDMA Write and RDMA Read. It must not be confused with GPUDirect RDMA, which is not related to the RDMA protocol. GPUDirect RDMA is one of the technologies enabled by NVIDIA in the GPUDirect family of technologies. It enables the network card to send or receive data directly accessing the GPU memory bypassing the CPU memory copies and Operating System routines. GPUDirect RDMA can be enabled by any network framework working with Ethernet, InfiniBand, or RoCE.

RDMA GPU data path with GPUNetIO

RDMA provides direct access between the main memory of two hosts without involving the operating system, cache, or storage. This enables data transfers with high throughput, low latency, and low CPU utilization. This is accomplished by registering and sharing a local memory area with the remote host (or peer) so the remote host knows how to access it.

An application where two peers need to exchange data over RDMA typically follows three fundamental steps:

IB Verbs and DOCA GPUNetIO perftests

In DOCA 2.7, a new DOCA GPUNetIO RDMA client-server code sample has been introduced to show the usage of the new API and evaluate correctness. This post analyzes the performance comparison between GPUNetIO RDMA functions and IB Verbs RDMA functions reproducing one of the microbenchmarks in the well-known perftest suite.

In a nutshell, perftest is a collection of microbenchmarks to measure RDMA bandwidth (BW) and latency between two peers (server and client) using basic RDMA operations. Although the network control part happens in the CPU, it’s possible to specify whether data resides in the GPU memory by enabling GPUDirect RDMA with the --use_cuda flag.

As a generic overview, the RDMA write unidirectional BW benchmark (namely ib_write_bw) posts, on every RDMA queue, a list of write requests for messages of the same size for a fixed number of iterations and commands the NIC to execute the posted writes. This is the so-called “ring the doorbell” procedure. To ensure all the writes have been issued, before moving to the next iteration, it polls the completion queue waiting for the confirmation that every write has been executed correctly. For every message size, it is then possible to retrieve the total time spent to post and poll and calculate the BW in MB/s.

Figure 1 shows the IB Verbs ib_write_bw perftest main loop. At each iteration, the CPU posts a list of RDMA write requests, commands the NIC to execute them (ringing the doorbell) and waits for the completion before moving to the next iteration. With the CUDA flag enabled, packets to write are fetched locally from GPU memory instead of CPU memory.

Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO | NVIDIA Technical Blog (1)

The experiment was to reproduce the ib_write_bw microbenchmark with the DOCA library, using DOCA RDMA for the control path on the CPU to establish the client-server connection and DOCA GPUNetIO RDMA for the data path, posting the writes within a CUDA kernel. The comparison is not apples-to-apples, as perftest uses GPUDirect RDMA to transfer data but network communications are controlled by the CPU, whereas DOCA GPUNetIO uses both GPUDirect RDMA and GPUDirect Async to control network communications and data transfers from the GPU. The goal is to prove DOCA GPUNetIO RDMA performance is comparable with IB Verbs perftest, which is considered the baseline.

See Also

The Best GPUs: Early 2023 Update

To reproduce the ib_write_bw data path and measure the time taken to post the RDMA write operations for each message size, the CPU records a CUDA event, launches the rdma_write_bw CUDA kernel, and then records a second CUDA event. This should give a good approximation of the time in milliseconds taken by the CUDA kernel to post the RDMA writes using DOCA GPUNetIO functions, as shown in Code Snippet 1 below.

Int msg_sizes[MAX_MSG] = {....};for (int msg_idx = 0; msg_idx < MAX_MSG; msg_idx++) { do_warmup(); cuEventRecord(start_event, stream); rdma_write_bw<<<num_queue, msg_per_size, 0, stream>>>(msg_sizes[msg_idx], …); cuEventRecord(end_event, stream); cuEventSynchronize(end_event); cuEventElapsedTime(&total_ms, start_event, end_event); calculate_result(total_ms, msg_sizes[msg_idx], …)}

In Code Snippet 2 below, the CUDA kernel rdma_write_bw, for a given number of iterations, posts in parallel a sequence of RDMA writes using DOCA GPUNetIO device functions following the weak pattern. Each CUDA thread in the CUDA block posts a write operation.

__global__ void rdma_write_bw(struct doca_gpu_dev_rdma *rdma_gpu, const int num_iter, const size_t msg_size, const struct doca_gpu_buf_arr *server_local_buf_arr, const struct doca_gpu_buf_arr *server_remote_buf_arr){ struct doca_gpu_buf *remote_buf; struct doca_gpu_buf *local_buf; uint32_t curr_position; uint32_t mask_max_position; doca_gpu_dev_buf_get_buf(server_local_buf_arr, threadIdx.x, &local_buf); doca_gpu_dev_buf_get_buf(server_remote_buf_arr, threadIdx.x, &remote_buf); for (int iter_idx = 0; iter_idx < num_iter; iter_idx++) { doca_gpu_dev_rdma_get_info(rdma_gpu, &curr_position, &mask_max_position); doca_gpu_dev_rdma_write_weak(rdma_gpu, remote_buf, 0, local_buf, 0, msg_size, 0, DOCA_GPU_RDMA_WRITE_FLAG_NONE, (curr_position + threadIdx.x) & mask_max_position); /* Wait all CUDA threads to post their RDMA Write */ __syncthreads(); if (threadIdx.x == 0) { /* Only 1 CUDA thread can commit the writes in the queue to execute them */ doca_gpu_dev_rdma_commit_weak(rdma_gpu, blockDim.x); /* Only 1 CUDA thread can flush the RDMA queue waiting for the actual execution of the writes */ doca_gpu_dev_rdma_flush(rdma_gpu); } __syncthreads(); } return;}

Figure 2 depicts Code Snippet 2. At every iteration, the GPU CUDA kernel posts in parallel a list of RDMA Write requests (one per CUDA thread in the CUDA block). After synchronizing all CUDA threads, only thread 0 commands the NIC to execute (commit) the writes and waits for the completion (flush the queue) before moving to the next iteration.

Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO | NVIDIA Technical Blog (2)

To compare the performance, the same parameters have been set for IB Verbs perftest and DOCA GPUNetIO perftest: 1 RDMA queue, 2,048 iterations, 512 RDMA writes per iteration, and testing message sizes from 64 to 4,096 bytes.

Benchmarks have been executed on a Dell R750 machine with an NVIDIA H100 GPU and ConnectX-7 network card. As shown by the following figures, the overall time taken by the two implementations of perftest are totally comparable (Figure 3) as well as the peak BW reported in MB/s (Figure 4). Due to the nature of the different logic in the code, time (and thus the BW) is measured with different methodologies. The IB Verbs perftest uses the system clock, whereas the DOCA GPUNetIO perftest relies on CUDA events, which may have a different internal time measurement overhead.

Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO | NVIDIA Technical Blog (3)

Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO | NVIDIA Technical Blog (4)

Increasing to four queues for more parallelism, both IB Verbs perftest and DOCA GPUNetIO perftest can reach up to 16 GB/s in peak bandwidth, showing comparable elapsed times and peak bandwidth for both methods.

Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO | NVIDIA Technical Blog (5)

Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO | NVIDIA Technical Blog (6)

Note that an application like perftest is not the best tool to show the benefit for GPU utilization, as the amount of achievable parallelization is quite low. DOCA GPUNetIO perftest RDMA writes are posted in parallel (512 writes, each by a different CUDA thread) in the queue, but the time taken by posting is trivial (~4 microseconds). The majority of perftest time is taken by the NIC actually executing the RDMA writes, sending data over the network, and returning positive feedback.

This experiment can be considered a success, as it proves that using the DOCA GPUNetIO RDMA API doesn’t introduce any relevant overhead in comparison to using regular IB Verbs and performance targets are met running the same type of workload and workflow. ISV developers and end users can use DOCA GPUNetIO RDMA, gaining the benefits of GPUDirect async technology offloading to the GPU the control of the communication.

This architectural choice offers the following benefits:

A more scalable application, capable of managing at the same time multiple RDMA queues in parallel (typically one queue per CUDA block).
Ability to take advantage of the high degree of parallelism offered by the GPU having several CUDA threads working in parallel on different data and posting RDMA operations on the same queue at the lowest latency possible.
Lower CPU utilization, making the solution platform-independent (different CPU architectures don’t result in significant performance variations).
Fewer internal bus transactions (PCIe, for example), as there is no need to synchronize the work on the GPU with the CPU activity. The CPU is no longer responsible for sending or receiving data the GPU must process.

Ready to dive deeper into DOCA GPUNetIO? Check out the official DOCA GPUNetIO Programmer’s Guide for comprehensive insights and technical details.

To learn more about the power of DOCA GPUNetIO, see Inline GPU Packet Processing with NVIDIA DOCA GPUNetIO. For information about how DOCA GPUNetIO revolutionizes GPU communication, see Realizing the Power of Real-Time Network Processing with NVIDIA DOCA GPUNetIO.

Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO | NVIDIA Technical Blog (2024)

RDMA GPU data path with GPUNetIO

IB Verbs and DOCA GPUNetIO perftests