Author: Greg Gutmann
Affiliation: Tokyo Institute of Technology, Nvidia University Ambassador, Nvidia DLI
Introduction
The following post goes over a simple demonstration of CUDA graphs, using the vector add code from Visual Studio’s default CUDA project as a starting point. This post’s aim is to showcase an example of CUDA graphs in near their simplest possible form; therefore, many of their capabilities will not be covered. We will focus on showcasing the code involved with creating graphs manually, and less on the big picture. For an introduction to CUDA graphs, the following posts can be read.
Nvidia’s post on graphs: https://developer.nvidia.com/blog/cuda-graphs/
CUDA 10 post with graph explanation: https://developer.nvidia.com/blog/cuda-10-features-revealed/
CUDA graphs were created to help negate the overhead involved with submitting work to the GPU. This has become more relevant as GPU performance has increased over the years, and the ratio of time spent on GPU operations vs. the overhead associated with the operations has shifted. The overhead involved is typically measured on the microsecond scale; therefore, if your GPU operations take significantly longer than that, graphs may be less valuable for your work.
New Data Types Used
cudaGraph_t // CUDA graph (opaque) cudaGraphNode_t // CUDA graph node (opaque) cudaKernelNodeParams // CUDA GPU kernel node parameters
Graph Functions Used Below
cudaGraphCreate // Creates a graph cudaGraphAddMemcpyNode // Creates a memcpy node and adds it to a graph cudaGraphAddKernelNode // Creates a kernel execution node and adds it to a graph cudaGraphGetNodes // Returns a graph's nodes cudaGraphInstantiate // Creates an executable graph from a graph cudaGraphLaunch // Launches an executable graph in a stream cudaGraphExecDestroy // Destroys an executable graph cudaGraphDestroy // Destroys a graph
Graph API: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html
Manually Creating Graphs
The following will take a look at the process of manually creating a graph in contrast to creating a graph via stream capture. For Nvidia’s sample on both approaches, see the included sample in the CUDA Toolkit named simpleCudaGraphs.
Creating a CUDA Graph
Creating a CUDA graph is simple, just call cudaGraphCreate with a cudaGraph_t object.
// pGraph - Returns newly created<br> // graphflags - Graph creation flags, must be 0 cudaGraphCreate ( cudaGraph_t* pGraph, unsigned int flags )
Adding a Memory Copy Operation to a Graph
In this example, we are using cudaGraphAddMemcpyNode; however, it is also possible to use cudaGraphAddMemcpyNode1D for something closer to cudaMemcpy.
cudaGraphAddMemcpyNode ( cudaGraphNode_t* pGraphNode, cudaGraph_t graph, const cudaGraphNode_t* pDependencies, size_t numDependencies, const cudaMemcpy3DParms* pCopyParams ) cudaGraphAddMemcpyNode1D ( cudaGraphNode_t* pGraphNode, cudaGraph_t graph, const cudaGraphNode_t* pDependencies, size_t numDependencies, void* dst, const void* src, size_t count, cudaMemcpyKind kind )
The following code fills in the cudaMemcpy3DParms struct similar to how it was done in Nvidia’s sample code, then cudaGraphAddMemcpyNode is called after with memcpyParams as one of its inputs. In this example no dependencies are passed in.
cudaMemcpy3DParms = memcpyParams = { 0 }; memcpyParams.srcArray = NULL; memcpyParams.srcPos = make_cudaPos(0, 0, 0); memcpyParams.srcPtr = make_cudaPitchedPtr((void*)a, size * sizeof(int), size, 1); memcpyParams.dstArray = NULL; memcpyParams.dstPos = make_cudaPos(0, 0, 0); memcpyParams.dstPtr = make_cudaPitchedPtr(dev_a, size * sizeof(float), size, 1); memcpyParams.extent = make_cudaExtent(size * sizeof(float), 1, 1); memcpyParams.kind = cudaMemcpyHostToDevice; checkCudaErrors(cudaGraphAddMemcpyNode(&memcpyNode, graph, NULL, 0, &memcpyParams));
Adding a Kernel Operation to a Graph
For manually adding a kernel to a graph, the cudaKernelNodeParams must be filled in with a pointer to the function, pointers to the arguments, the launch configuration, and the amount of dynamic shared memory that will be used.
Then when calling cudaGraphAddKernelNode, the cudaKernelNodeParams and dependencies are passed in.
cudaKernelNodeParams kernelNodeParams = { 0 }; kernelNodeParams.func = (void*)addKernel; kernelNodeParams.gridDim = dim3(blocks, 1, 1); kernelNodeParams.blockDim = dim3(threads, 1, 1); kernelNodeParams.sharedMemBytes = 0; void* kernelArgs[4] = { (void*)&dev_c, (void*)&dev_a, (void*)&dev_b, &size }; kernelNodeParams.kernelParams = kernelArgs; kernelNodeParams.extra = NULL; checkCudaErrors(cudaGraphAddKernelNode(&kernelNode, graph, nodeDependencies.data(), nodeDependencies.size(), &kernelNodeParams));
Dependencies
Adding graph node(s) as dependencies is one of the more critical aspects involved with creating graphs manually. The dependencies declared will determine which operations can potentially overlap and which need to be serialized.
std::vector<cudaGraphNode_t> nodeDependencies; // Dependency vector nodeDependencies.push_back(kernelNode); // Adding a cudaGraphNode_t as a dependency nodeDependencies.clear(); // Clearing the vector after node has been added to the graph with the previous dependencies
The lines of code above illustrate the use of a vector to hold a single or possibly many cudaGraphNode_t’s as a list of dependencies, which then can be passed in during the creation of a graph node to follow. After the vector of dependencies can be cleared, assuming no other following node will have the same list of dependencies.
For example, on the right graph in the figure below, when adding kernel D to the graph, a cudaGraphNode_t vector holding kernel B and kernel C will be input as dependencies. Then after kernel D has been added, the cudaGraphNode_t dependency vector can be cleared as the final memcpy DtoH does not directly depend on kernel B and kernel C. The final memcpy DtoH will have a cudaGraphNode_t dependency vector containing kernel A and kernel D as one of its node creation parameters.

Performance Results of the Two Functions
addWithCuda(...) // Nearly the same as the code from Visual Studio's startup project addWithCudaGraph(...) // The above function converted to use a CUDA graph

Code
The code for the two main functions compared above in the performance results are shown below as a reference. However, if you wish to run the code I would recommend downloading it from the following GitHub link as not all of the code has been shown on this page.
Github link: https://github.com/Eths33/CUDA_Graph_Vector_Add
Function Without CUDA Graph
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size, int loopCount, __int64* tElapsed) { int *dev_a = 0; int *dev_b = 0; int *dev_c = 0; cudaError_t cudaStatus; int threads = 256; int blocks = (size + threads - 1) / threads; // Pre-declare timers for reducing warnings related to the goto statements std::chrono::steady_clock::time_point t1; std::chrono::steady_clock::time_point t2; __int64 us_elapsed = 0; // Choose which GPU to run on, change this on a multi-GPU system. Then allocate GPU memory. { cudaStatus = cudaSetDevice(0); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?"); } cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int)); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); } cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int)); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); } cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int)); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); } } t1 = Clock::now(); for (int i = 0; i < loopCount; ++i) { // Copy input vectors from host memory to GPU buffers. cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } // Launch a kernel on the GPU with one thread for each element. addKernel << <blocks, threads >> > (dev_c, dev_a, dev_b, size); // Check for any errors launching the kernel cudaStatus = cudaGetLastError(); if (cudaStatus != cudaSuccess) { fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus)); goto Error; } // cudaDeviceSynchronize waits for the kernel to finish, and returns // any errors encountered during the launch. // NOTE: Below in the graph implementation this sync is included via graph dependencies cudaStatus = cudaDeviceSynchronize(); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus); goto Error; } // Copy output vector from GPU buffer to host memory. cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } } t2 = Clock::now(); us_elapsed = (__int64)(t2 - t1).count() / 1000; printf("Looped %d time(s) in %lld microseconds\n", loopCount, us_elapsed); *tElapsed = us_elapsed; Error: cudaFree(dev_c); cudaFree(dev_a); cudaFree(dev_b); return cudaStatus; }
Function Using Manually Created CUDA Graph
cudaError_t addWithCudaGraph(int* c, const int* a, const int* b, unsigned int size, int loopCount, __int64* tElapsedGraph) { // Original int* dev_a = 0; int* dev_b = 0; int* dev_c = 0; cudaError_t cudaStatus; int threads = 256; int blocks = (size + threads - 1) / threads; // For Graph cudaStream_t streamForGraph; cudaGraph_t graph; std::vector<cudaGraphNode_t> nodeDependencies; cudaGraphNode_t memcpyNode, kernelNode; cudaKernelNodeParams kernelNodeParams = { 0 }; cudaMemcpy3DParms memcpyParams = { 0 }; // Choose which GPU to run on, change this on a multi-GPU system. Then allocate GPU memory. { cudaStatus = cudaSetDevice(0); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?"); } cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int)); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); } cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int)); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); } cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int)); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); } } // Start of Graph Creation checkCudaErrors(cudaGraphCreate(&graph, 0)); checkCudaErrors(cudaStreamCreateWithFlags(&streamForGraph, cudaStreamNonBlocking)); // Add memcpy nodes for copying input vectors from host memory to GPU buffers memset(&memcpyParams, 0, sizeof(memcpyParams)); memcpyParams.srcArray = NULL; memcpyParams.srcPos = make_cudaPos(0, 0, 0); memcpyParams.srcPtr = make_cudaPitchedPtr((void*)a, size * sizeof(int), size, 1); memcpyParams.dstArray = NULL; memcpyParams.dstPos = make_cudaPos(0, 0, 0); memcpyParams.dstPtr = make_cudaPitchedPtr(dev_a, size * sizeof(float), size, 1); memcpyParams.extent = make_cudaExtent(size * sizeof(float), 1, 1); memcpyParams.kind = cudaMemcpyHostToDevice; checkCudaErrors(cudaGraphAddMemcpyNode(&memcpyNode, graph, NULL, 0, &memcpyParams)); nodeDependencies.push_back(memcpyNode); memset(&memcpyParams, 0, sizeof(memcpyParams)); memcpyParams.srcArray = NULL; memcpyParams.srcPos = make_cudaPos(0, 0, 0); memcpyParams.srcPtr = make_cudaPitchedPtr((void*)b, size * sizeof(int), size, 1); memcpyParams.dstArray = NULL; memcpyParams.dstPos = make_cudaPos(0, 0, 0); memcpyParams.dstPtr = make_cudaPitchedPtr(dev_b, size * sizeof(float), size, 1); memcpyParams.extent = make_cudaExtent(size * sizeof(float), 1, 1); memcpyParams.kind = cudaMemcpyHostToDevice; checkCudaErrors(cudaGraphAddMemcpyNode(&memcpyNode, graph, NULL, 0, &memcpyParams)); nodeDependencies.push_back(memcpyNode); // Add a kernel node for launching a kernel on the GPU memset(&kernelNodeParams, 0, sizeof(kernelNodeParams)); kernelNodeParams.func = (void*)addKernel; kernelNodeParams.gridDim = dim3(blocks, 1, 1); kernelNodeParams.blockDim = dim3(threads, 1, 1); kernelNodeParams.sharedMemBytes = 0; void* kernelArgs[4] = { (void*)&dev_c, (void*)&dev_a, (void*)&dev_b, &size }; kernelNodeParams.kernelParams = kernelArgs; kernelNodeParams.extra = NULL; checkCudaErrors(cudaGraphAddKernelNode(&kernelNode, graph, nodeDependencies.data(), nodeDependencies.size(), &kernelNodeParams)); nodeDependencies.clear(); nodeDependencies.push_back(kernelNode); // Add memcpy node for copying output vector from GPU buffers to host memory memset(&memcpyParams, 0, sizeof(memcpyParams)); memcpyParams.srcArray = NULL; memcpyParams.srcPos = make_cudaPos(0, 0, 0); memcpyParams.srcPtr = make_cudaPitchedPtr(dev_c, size * sizeof(int), size, 1); memcpyParams.dstArray = NULL; memcpyParams.dstPos = make_cudaPos(0, 0, 0); memcpyParams.dstPtr = make_cudaPitchedPtr(c, size * sizeof(int), size, 1); memcpyParams.extent = make_cudaExtent(size * sizeof(int), 1, 1); memcpyParams.kind = cudaMemcpyDeviceToHost; checkCudaErrors(cudaGraphAddMemcpyNode(&memcpyNode, graph, nodeDependencies.data(), nodeDependencies.size(), &memcpyParams)); if (VERBOSE) { cudaGraphNode_t* nodes = NULL; size_t numNodes = 0; checkCudaErrors(cudaGraphGetNodes(graph, nodes, &numNodes)); printf("Num of nodes in the graph created manually = %zu\n", numNodes); } // Create an executable graph from a graph cudaGraphExec_t graphExec; checkCudaErrors(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0)); // Run the graph auto t1 = Clock::now(); for (int i = 0; i < loopCount; ++i) { checkCudaErrors(cudaGraphLaunch(graphExec, streamForGraph)); checkCudaErrors(cudaStreamSynchronize(streamForGraph)); } auto t2 = Clock::now(); __int64 us_elapsed = (__int64)(t2 - t1).count() / 1000; printf("Looped %d time(s) in %lld microseconds\n", loopCount, us_elapsed); *tElapsedGraph = us_elapsed; // Clean up checkCudaErrors(cudaGraphExecDestroy(graphExec)); checkCudaErrors(cudaGraphDestroy(graph)); checkCudaErrors(cudaStreamDestroy(streamForGraph)); cudaFree(dev_c); cudaFree(dev_a); cudaFree(dev_b); return cudaStatus; }
Contact me if you would like to use the contents of this post. Thanks.
Copyright © 2020 by Gregory Gutmann