CUDA Graph Usage: CUDA Feature Testing

Author: Greg Gutmann
Affiliation: Tokyo Institute of Technology, Nvidia University Ambassador, Nvidia DLI

Introduction

The following post goes over a simple demonstration of CUDA graphs, using the vector add code from Visual Studio’s default CUDA project as a starting point. This post’s aim is to showcase an example of CUDA graphs in near their simplest possible form; therefore, many of their capabilities will not be covered. We will focus on showcasing the code involved with creating graphs manually, and less on the big picture. For an introduction to CUDA graphs, the following posts can be read.

Nvidia’s post on graphs: https://developer.nvidia.com/blog/cuda-graphs/
CUDA 10 post with graph explanation: https://developer.nvidia.com/blog/cuda-10-features-revealed/

CUDA graphs were created to help negate the overhead involved with submitting work to the GPU. This has become more relevant as GPU performance has increased over the years, and the ratio of time spent on GPU operations vs. the overhead associated with the operations has shifted. The overhead involved is typically measured on the microsecond scale; therefore, if your GPU operations take significantly longer than that, graphs may be less valuable for your work.

New Data Types Used

cudaGraph_t // CUDA graph (opaque)
cudaGraphNode_t // CUDA graph node (opaque)
cudaKernelNodeParams // CUDA GPU kernel node parameters

Graph Functions Used Below

cudaGraphCreate // Creates a graph
cudaGraphAddMemcpyNode // Creates a memcpy node and adds it to a graph
cudaGraphAddKernelNode // Creates a kernel execution node and adds it to a graph
cudaGraphGetNodes // Returns a graph's nodes
cudaGraphInstantiate // Creates an executable graph from a graph
cudaGraphLaunch // Launches an executable graph in a stream
cudaGraphExecDestroy // Destroys an executable graph
cudaGraphDestroy // Destroys a graph

Graph API: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html

Manually Creating Graphs

The following will take a look at the process of manually creating a graph in contrast to creating a graph via stream capture. For Nvidia’s sample on both approaches, see the included sample in the CUDA Toolkit named simpleCudaGraphs.

Creating a CUDA Graph

Creating a CUDA graph is simple, just call cudaGraphCreate with a cudaGraph_t object.

// pGraph - Returns newly created<br>
// graphflags - Graph creation flags, must be 0
cudaGraphCreate ( cudaGraph_t* pGraph, unsigned int flags )

Adding a Memory Copy Operation to a Graph

In this example, we are using cudaGraphAddMemcpyNode; however, it is also possible to use cudaGraphAddMemcpyNode1D for something closer to cudaMemcpy.

cudaGraphAddMemcpyNode ( cudaGraphNode_t* pGraphNode, cudaGraph_t graph, const cudaGraphNode_t* pDependencies, size_t numDependencies, const cudaMemcpy3DParms* pCopyParams )

cudaGraphAddMemcpyNode1D ( cudaGraphNode_t* pGraphNode, cudaGraph_t graph, const cudaGraphNode_t* pDependencies, size_t numDependencies, void* dst, const void* src, size_t count, cudaMemcpyKind kind )

The following code fills in the cudaMemcpy3DParms struct similar to how it was done in Nvidia’s sample code, then cudaGraphAddMemcpyNode is called after with memcpyParams as one of its inputs. In this example no dependencies are passed in.

cudaMemcpy3DParms = memcpyParams = { 0 };

memcpyParams.srcArray = NULL;
memcpyParams.srcPos = make_cudaPos(0, 0, 0);
memcpyParams.srcPtr = make_cudaPitchedPtr((void*)a, size * sizeof(int), size, 1);
memcpyParams.dstArray = NULL;
memcpyParams.dstPos = make_cudaPos(0, 0, 0);
memcpyParams.dstPtr = make_cudaPitchedPtr(dev_a, size * sizeof(float), size, 1);
memcpyParams.extent = make_cudaExtent(size * sizeof(float), 1, 1);
memcpyParams.kind = cudaMemcpyHostToDevice;

checkCudaErrors(cudaGraphAddMemcpyNode(&memcpyNode, graph, NULL, 0, &memcpyParams));

Adding a Kernel Operation to a Graph

For manually adding a kernel to a graph, the cudaKernelNodeParams must be filled in with a pointer to the function, pointers to the arguments, the launch configuration, and the amount of dynamic shared memory that will be used.

Then when calling cudaGraphAddKernelNode, the cudaKernelNodeParams and dependencies are passed in.

cudaKernelNodeParams kernelNodeParams = { 0 };

kernelNodeParams.func = (void*)addKernel;
kernelNodeParams.gridDim = dim3(blocks, 1, 1);
kernelNodeParams.blockDim = dim3(threads, 1, 1);
kernelNodeParams.sharedMemBytes = 0;
void* kernelArgs[4] = { (void*)&dev_c, (void*)&dev_a, (void*)&dev_b, &size };
kernelNodeParams.kernelParams = kernelArgs;
kernelNodeParams.extra = NULL;

checkCudaErrors(cudaGraphAddKernelNode(&kernelNode, graph, nodeDependencies.data(), nodeDependencies.size(), &kernelNodeParams));

Dependencies

Adding graph node(s) as dependencies is one of the more critical aspects involved with creating graphs manually. The dependencies declared will determine which operations can potentially overlap and which need to be serialized.

std::vector<cudaGraphNode_t> nodeDependencies; // Dependency vector
nodeDependencies.push_back(kernelNode); // Adding a cudaGraphNode_t as a dependency
nodeDependencies.clear(); // Clearing the vector after node has been added to the graph with the previous dependencies

The lines of code above illustrate the use of a vector to hold a single or possibly many cudaGraphNode_t’s as a list of dependencies, which then can be passed in during the creation of a graph node to follow. After the vector of dependencies can be cleared, assuming no other following node will have the same list of dependencies.

For example, on the right graph in the figure below, when adding kernel D to the graph, a cudaGraphNode_t vector holding kernel B and kernel C will be input as dependencies. Then after kernel D has been added, the cudaGraphNode_t dependency vector can be cleared as the final memcpy DtoH does not directly depend on kernel B and kernel C. The final memcpy DtoH will have a cudaGraphNode_t dependency vector containing kernel A and kernel D as one of its node creation parameters.

Graph on the left matches the code below, containing simple serial dependencies. Graph on the right is an example of a graph with more complex dependencies to illustrate much more can be done with graphs.

Performance Results of the Two Functions

addWithCuda(...) // Nearly the same as the code from Visual Studio's startup project
addWithCudaGraph(...) // The above function converted to use a CUDA graph
Output from the code being discussed

Code

The code for the two main functions compared above in the performance results are shown below as a reference. However, if you wish to run the code I would recommend downloading it from the following GitHub link as not all of the code has been shown on this page.

Github link: https://github.com/Eths33/CUDA_Graph_Vector_Add

Function Without CUDA Graph

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size, int loopCount, __int64* tElapsed)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    int threads = 256;
    int blocks = (size + threads - 1) / threads;

    // Pre-declare timers for reducing warnings related to the goto statements
    std::chrono::steady_clock::time_point t1;
    std::chrono::steady_clock::time_point t2;
    __int64 us_elapsed = 0;

    // Choose which GPU to run on, change this on a multi-GPU system. Then allocate GPU memory.
    {
        cudaStatus = cudaSetDevice(0);
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        }

        cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaMalloc failed!");
        }

        cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaMalloc failed!");
        }

        cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaMalloc failed!");
        }
    }
    
    t1 = Clock::now();
    for (int i = 0; i < loopCount; ++i) {
        // Copy input vectors from host memory to GPU buffers.
        cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaMemcpy failed!");
            goto Error;
        }

        cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaMemcpy failed!");
            goto Error;
        }

        // Launch a kernel on the GPU with one thread for each element.
        addKernel << <blocks, threads >> > (dev_c, dev_a, dev_b, size);

        // Check for any errors launching the kernel
        cudaStatus = cudaGetLastError();
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
            goto Error;
        }

        // cudaDeviceSynchronize waits for the kernel to finish, and returns
        // any errors encountered during the launch.
        // NOTE: Below in the graph implementation this sync is included via graph dependencies 
        cudaStatus = cudaDeviceSynchronize();
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
            goto Error;
        }

        // Copy output vector from GPU buffer to host memory.
        cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaMemcpy failed!");
            goto Error;
        }
    }
    t2 = Clock::now();
    us_elapsed = (__int64)(t2 - t1).count() / 1000;
    printf("Looped %d time(s) in %lld microseconds\n", loopCount, us_elapsed);
    *tElapsed = us_elapsed;

Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
    
    return cudaStatus;
}

Function Using Manually Created CUDA Graph

cudaError_t addWithCudaGraph(int* c, const int* a, const int* b, unsigned int size, int loopCount, __int64* tElapsedGraph)
{
    // Original
    int* dev_a = 0;
    int* dev_b = 0;
    int* dev_c = 0;
    cudaError_t cudaStatus;

    int threads = 256;
    int blocks = (size + threads - 1) / threads;

    // For Graph
    cudaStream_t streamForGraph;
    cudaGraph_t graph;
    std::vector<cudaGraphNode_t> nodeDependencies;
    cudaGraphNode_t memcpyNode, kernelNode;
    cudaKernelNodeParams kernelNodeParams = { 0 };
    cudaMemcpy3DParms memcpyParams = { 0 };

    // Choose which GPU to run on, change this on a multi-GPU system. Then allocate GPU memory.
    {
        cudaStatus = cudaSetDevice(0);
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        }

        cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaMalloc failed!");
        }

        cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaMalloc failed!");
        }

        cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaMalloc failed!");
        }
    }

    // Start of Graph Creation

    checkCudaErrors(cudaGraphCreate(&graph, 0));
    checkCudaErrors(cudaStreamCreateWithFlags(&streamForGraph, cudaStreamNonBlocking));

    // Add memcpy nodes for copying input vectors from host memory to GPU buffers

    memset(&memcpyParams, 0, sizeof(memcpyParams));

    memcpyParams.srcArray = NULL;
    memcpyParams.srcPos = make_cudaPos(0, 0, 0);
    memcpyParams.srcPtr = make_cudaPitchedPtr((void*)a, size * sizeof(int), size, 1);
    memcpyParams.dstArray = NULL;
    memcpyParams.dstPos = make_cudaPos(0, 0, 0);
    memcpyParams.dstPtr = make_cudaPitchedPtr(dev_a, size * sizeof(float), size, 1);
    memcpyParams.extent = make_cudaExtent(size * sizeof(float), 1, 1);
    memcpyParams.kind = cudaMemcpyHostToDevice;

    checkCudaErrors(cudaGraphAddMemcpyNode(&memcpyNode, graph, NULL, 0, &memcpyParams));
    nodeDependencies.push_back(memcpyNode);

    memset(&memcpyParams, 0, sizeof(memcpyParams));

    memcpyParams.srcArray = NULL;
    memcpyParams.srcPos = make_cudaPos(0, 0, 0);
    memcpyParams.srcPtr = make_cudaPitchedPtr((void*)b, size * sizeof(int), size, 1);
    memcpyParams.dstArray = NULL;
    memcpyParams.dstPos = make_cudaPos(0, 0, 0);
    memcpyParams.dstPtr = make_cudaPitchedPtr(dev_b, size * sizeof(float), size, 1);
    memcpyParams.extent = make_cudaExtent(size * sizeof(float), 1, 1);
    memcpyParams.kind = cudaMemcpyHostToDevice;
        
    checkCudaErrors(cudaGraphAddMemcpyNode(&memcpyNode, graph, NULL, 0, &memcpyParams));
    nodeDependencies.push_back(memcpyNode);

    // Add a kernel node for launching a kernel on the GPU

    memset(&kernelNodeParams, 0, sizeof(kernelNodeParams));

    kernelNodeParams.func = (void*)addKernel;
    kernelNodeParams.gridDim = dim3(blocks, 1, 1);
    kernelNodeParams.blockDim = dim3(threads, 1, 1);
    kernelNodeParams.sharedMemBytes = 0;
    void* kernelArgs[4] = { (void*)&dev_c, (void*)&dev_a, (void*)&dev_b, &size };
    kernelNodeParams.kernelParams = kernelArgs;
    kernelNodeParams.extra = NULL;

    checkCudaErrors(cudaGraphAddKernelNode(&kernelNode, graph, nodeDependencies.data(), nodeDependencies.size(), &kernelNodeParams));
    nodeDependencies.clear();
    nodeDependencies.push_back(kernelNode);

    // Add memcpy node for copying output vector from GPU buffers to host memory

    memset(&memcpyParams, 0, sizeof(memcpyParams));

    memcpyParams.srcArray = NULL;
    memcpyParams.srcPos = make_cudaPos(0, 0, 0);
    memcpyParams.srcPtr = make_cudaPitchedPtr(dev_c, size * sizeof(int), size, 1);
    memcpyParams.dstArray = NULL;
    memcpyParams.dstPos = make_cudaPos(0, 0, 0);
    memcpyParams.dstPtr = make_cudaPitchedPtr(c, size * sizeof(int), size, 1);
    memcpyParams.extent = make_cudaExtent(size * sizeof(int), 1, 1);
    memcpyParams.kind = cudaMemcpyDeviceToHost;
    checkCudaErrors(cudaGraphAddMemcpyNode(&memcpyNode, graph, nodeDependencies.data(), nodeDependencies.size(), &memcpyParams)); 

    if (VERBOSE) {
        cudaGraphNode_t* nodes = NULL;
        size_t numNodes = 0;
        checkCudaErrors(cudaGraphGetNodes(graph, nodes, &numNodes));
        printf("Num of nodes in the graph created manually = %zu\n", numNodes);
    }

    // Create an executable graph from a graph

    cudaGraphExec_t graphExec;
    checkCudaErrors(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));

    // Run the graph

    auto t1 = Clock::now();
    for (int i = 0; i < loopCount; ++i) {
        checkCudaErrors(cudaGraphLaunch(graphExec, streamForGraph));
        checkCudaErrors(cudaStreamSynchronize(streamForGraph));
    }
    auto t2 = Clock::now();
    __int64 us_elapsed = (__int64)(t2 - t1).count() / 1000;
    printf("Looped %d time(s) in %lld microseconds\n", loopCount, us_elapsed);
    *tElapsedGraph = us_elapsed;

    // Clean up

    checkCudaErrors(cudaGraphExecDestroy(graphExec));
    checkCudaErrors(cudaGraphDestroy(graph));
    checkCudaErrors(cudaStreamDestroy(streamForGraph));

    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);

    return cudaStatus;
}

Contact me if you would like to use the contents of this post. Thanks.
Copyright © 2020 by Gregory Gutmann

Leave a Reply

Close Menu