Error Checking & Debugging GPU Code: CUDA Introduction Part 3

Author: Greg Gutmann
Affiliation: Tokyo Institute of Technology, Nvidia University Ambassador, Nvidia DLI 

Introduction

This is a continuation of my posts on CUDA programming, for the previous post on thread indexing and memory click here [Post 2]. In this post, we will look at methods of error checking and debugging GPU code. However, the the CUDA API function calls are not explained in detail, for this I recommend the CUDA documentation [Error Handling].

For beginners, it is not necessary to try to fully understand everything mentioned, specifically later in the post when talking about asynchronous and concurrent GPU debugging. But it is valuable to have a bit of familiarity with the topic for when it becomes relevant in the future.

Debugging with Print Statements

Caution: the SyntaxHighlighter plugin used for the code block has an error at the time of posting. If you see “&” replace it with “&”.

// 0_creatingErrors.cu
#include <stdio.h>

__global__ void kernelA(int * globalArray){
	int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;

	// If the problem is small or if printing a subset of the problem 
	// (inside conditional expression, etc...). 
	// Then using printf inside of a kernel can be a viable debugging approach.
	printf("blockIdx.x:%d * blockDim.x:%d + threadIdx.x:%d = globalThreadId:%d\n", blockIdx.x, blockDim.x, threadIdx.x, globalThreadId);
	globalArray[globalThreadId] = globalThreadId;
}
 
int main()
{
	int elementCount = 32;
	int dataSize = elementCount * sizeof(int);
	
	cudaSetDevice(0);
	
	int * managedArray;
	cudaMallocManaged(&amp;managedArray, dataSize);

	kernelA <<<4,8>>>(managedArray);

    cudaDeviceSynchronize(); 
	
	printf("\n");

	// Printing a portion of results can be another good debugging approach
	for(int i = 0; i < elementCount; i++){
		printf("%d%s", managedArray[i], (i < elementCount - 1) ? ", " : "\n");
	}	
	
	cudaFree(managedArray);

	cudaDeviceReset();
 
	return 0;
}

One of the most basic methods of debugging, or confirming results, is to simply print out the results.

0, 1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

By doing this we can see there is a problem with indices 8-31, and potentially all indices. But there is not a lot of information here, it would likely be even less enlightening if the kernel was more complex.

The second way to print is from the CUDA kernel itself.

Sample's Output from the GPU Kernel:
blockIdx.x:1 * blockDim.x:8 + threadIdx.x:0 = globalThreadId:0
blockIdx.x:1 * blockDim.x:8 + threadIdx.x:1 = globalThreadId:1
blockIdx.x:1 * blockDim.x:8 + threadIdx.x:2 = globalThreadId:2
blockIdx.x:1 * blockDim.x:8 + threadIdx.x:3 = globalThreadId:3
blockIdx.x:1 * blockDim.x:8 + threadIdx.x:4 = globalThreadId:4
blockIdx.x:1 * blockDim.x:8 + threadIdx.x:5 = globalThreadId:5
blockIdx.x:1 * blockDim.x:8 + threadIdx.x:6 = globalThreadId:6
blockIdx.x:1 * blockDim.x:8 + threadIdx.x:7 = globalThreadId:7
blockIdx.x:3 * blockDim.x:8 + threadIdx.x:0 = globalThreadId:0
blockIdx.x:3 * blockDim.x:8 + threadIdx.x:1 = globalThreadId:1
blockIdx.x:3 * blockDim.x:8 + threadIdx.x:2 = globalThreadId:2
blockIdx.x:3 * blockDim.x:8 + threadIdx.x:3 = globalThreadId:3
blockIdx.x:3 * blockDim.x:8 + threadIdx.x:4 = globalThreadId:4
blockIdx.x:3 * blockDim.x:8 + threadIdx.x:5 = globalThreadId:5
blockIdx.x:3 * blockDim.x:8 + threadIdx.x:6 = globalThreadId:6
blockIdx.x:3 * blockDim.x:8 + threadIdx.x:7 = globalThreadId:7
blockIdx.x:2 * blockDim.x:8 + threadIdx.x:0 = globalThreadId:0
blockIdx.x:2 * blockDim.x:8 + threadIdx.x:1 = globalThreadId:1
blockIdx.x:2 * blockDim.x:8 + threadIdx.x:2 = globalThreadId:2
blockIdx.x:2 * blockDim.x:8 + threadIdx.x:3 = globalThreadId:3
blockIdx.x:2 * blockDim.x:8 + threadIdx.x:4 = globalThreadId:4
blockIdx.x:2 * blockDim.x:8 + threadIdx.x:5 = globalThreadId:5
blockIdx.x:2 * blockDim.x:8 + threadIdx.x:6 = globalThreadId:6
blockIdx.x:2 * blockDim.x:8 + threadIdx.x:7 = globalThreadId:7
blockIdx.x:0 * blockDim.x:8 + threadIdx.x:0 = globalThreadId:0
blockIdx.x:0 * blockDim.x:8 + threadIdx.x:1 = globalThreadId:1
blockIdx.x:0 * blockDim.x:8 + threadIdx.x:2 = globalThreadId:2
blockIdx.x:0 * blockDim.x:8 + threadIdx.x:3 = globalThreadId:3
blockIdx.x:0 * blockDim.x:8 + threadIdx.x:4 = globalThreadId:4
blockIdx.x:0 * blockDim.x:8 + threadIdx.x:5 = globalThreadId:5
blockIdx.x:0 * blockDim.x:8 + threadIdx.x:6 = globalThreadId:6
blockIdx.x:0 * blockDim.x:8 + threadIdx.x:7 = globalThreadId:7

From this, we can see that the correct number of threads and blocks are launched but there is something wrong when computing globalThreadId (blockIdx.y was used instead of blockIdx. x). If the blockIdx.y was printed it would be more obvious what is happening, all zeros, but often mistakes appear in one spot but not another.

Another detail that is easy to see here is that, when running massively parallel code, print statements may quickly become overwhelming. However, targeted printing can still be useful.

Part 1: Error Checking Helpers

It is often not convenient or clean looking to always write error checking code after every operation or function. Thus it is common for many to write macros or functions to speed up writing code and make it look cleaner.

Below an example of explicitly writing out error checking operations, taken from Visual Studio’s CUDA startup project. On the plus side with this approach, you could provide very exact error reporting and possible fixes.

// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    goto Error;
}

Below the helper samples in the following subsections have been broken up into .cu and .cuh files for simpler compilation (linking, etc…).

CUDA Error Checking Macros

Macros are a popular option as they can automatically collect and report information about the file, function and line number. If you browse the samples included with the CUDA SDK you will come across error checking macros. The sample code below was actually originally from there but has been modified quite a bit.

// errorCheckingMacro.cuh
#ifndef CHECK_CUDA_ERROR_M_H
#define CHECK_CUDA_ERROR_M_H

#define PRINT_ON_SUCCESS 1

// To be used around calls that return an error code, ex. cudaDeviceSynchronize or cudaMallocManaged
void checkError(cudaError_t code, char const * func, const char *file, const int line, bool abort = true);
#define checkCUDAError(val) { checkError((val), #val, __FILE__, __LINE__); }	// in-line regular function
#define checkCUDAError2(val) check((val), #val, __FILE__, __LINE__) // typical macro 

// To be used after calls that do not return an error code, ex. kernels to check kernel launch errors
void checkLastError(char const * func, const char *file, const int line, bool abort = true);
#define checkLastCUDAError(func) { checkLastError(func, __FILE__, __LINE__); }
#define checkLastCUDAError_noAbort(func) { checkLastError(func, __FILE__, __LINE__, 0); }

#endif // CHECK_CUDA_ERROR_M_H

In the above sample on lines 9 and 10 two different ways of writing the same macro can be seen. I tend to use curly brackets since it acts like a regular function when invoked. For more details on macros see this post: http://www.ebyte.it/library/codesnippets/WritingCppMacros.html#4

// errorCheckingMacro.cu
#include "errorCheckingMacro.cuh"
#include <stdio.h>

// Assumes single device when calling cudaDeviceReset(); and exit(code);
// In some cases a more lengthy program clean up / termination may be needed

void checkError(cudaError_t code, char const * func, const char *file, const int line, bool abort)
{
	if (code != cudaSuccess) 
	{
		const char * errorMessage = cudaGetErrorString(code);
		fprintf(stderr, "CUDA error returned from \"%s\" at %s:%d, Error code: %d (%s)\n", func, file, line, code, errorMessage);
		if (abort){
			cudaDeviceReset();
			exit(code);
		}
	}
	else if (PRINT_ON_SUCCESS)
	{
		const char * errorMessage = cudaGetErrorString(code);
		fprintf(stderr, "CUDA error returned from \"%s\" at %s:%d, Error code: %d (%s)\n", func, file, line, code, errorMessage);
	}
}

void checkLastError(char const * func, const char *file, const int line, bool abort)
{
	cudaError_t code = cudaGetLastError();
	if (code != cudaSuccess)
	{
		const char * errorMessage = cudaGetErrorString(code);
		fprintf(stderr, "CUDA error returned from \"%s\" at %s:%d, Error code: %d (%s)\n", func, file, line, code, errorMessage);
		if (abort) {
			cudaDeviceReset();
			exit(code);
		}
	}
	else if (PRINT_ON_SUCCESS)
	{
		const char * errorMessage = cudaGetErrorString(code);
		fprintf(stderr, "CUDA error returned from \"%s\" at %s:%d, Error code: %d (%s)\n", func, file, line, code, errorMessage);
	}
}

Above, the first function shown takes the CUDA error code as one of its parameters, then uses that to check if an error occurred and if so what kind. The second function instead calls cudaGetLastError which is needed when CUDA operations do not return an error code (ex. kernels).

CUDA Error Checking Functions with Added Functionality

It should be noted that the examples shown below could be converted into macros. The samples are intended to show another approach and some additional methods of error checking asynchronous operations.

Non-macro based error checking lacks the ability to automatically gather the file name, the function name and the line number for printing or other uses. As a solution to this, when calling the error checking function I pass in an identification string.

// errorChecking.cuh
#ifndef CHECK_CUDA_ERROR_H
#define CHECK_CUDA_ERROR_H

// This could be set with a compile time flag ex. DEBUG or _DEBUG
// But then would need to use #if / #ifdef not if / else if in code
#define FORCE_SYNC_GPU 0
#define PRINT_ON_SUCCESS 1

cudaError_t checkAndPrint(const char * name, int sync = 0);
cudaError_t checkCUDAError(const char * name, int sync = 0);

#endif // CHECK_CUDA_ERROR_H

The code below shows error checking that will check errors with or without forced synchronization.

Pros: Forcing the code to synchronize will ensure the operation just called has finished prior to checking if any errors have occurred. Without doing this, errors from asynchronous calls may appear later on in the code when checking errors for other operations, leading to confusion. Or possibly the error never being reported.

Cons: forced synchronization is only to be used when debugging as the code will take a very large performance hit with constant synchronization calls. Also, forced synchronization may change how the code runs since it will prevent most operations from overlapping. For example, the code might normally use many asynchronous calls or call operations on independent threads running concurrently. This can lead to errors that show up in a release but the not show up when debugging.

// errorChecking.cu
#include "errorChecking.cuh"
#include <stdio.h>

cudaError_t checkAndPrint(const char * name, int sync) {
	cudaError_t err = cudaGetLastError();
	if (err != cudaSuccess)
	{
		const char * errorMessage = cudaGetErrorString(err);
		fprintf(stderr, "CUDA error check \"%s\" returned ERROR code: %d (%s) %s \n", name, err, errorMessage, (sync) ? "after sync" : "");
	}
	else if (PRINT_ON_SUCCESS) {
		printf("CUDA error check \"%s\" executed successfully %s\n", name, (sync) ? "after sync" : "");
	}
	return err;
}

cudaError_t checkCUDAError(const char * name, int sync) {
	cudaError_t err = cudaSuccess;
	if (sync || FORCE_SYNC_GPU) {
		err = checkAndPrint(name, 0);
		cudaDeviceSynchronize();
		err = checkAndPrint(name, 1);
	}
	else {
		err = checkAndPrint(name, 0);
	}
	return err;
}

As seen above the function checkCUDAError has been designed to be called right after GPU functions instead of directly taking the return value from functions that return an error value. This is because not all CUDA functions return an error value, and some calls are asynchronous, as mentioned.

For the asynchronous case, the error checking function checks for errors after invoking the operation and after the operation has completed. Example case: kernels can have pre-launch errors (incorrect configuration) and kernel execution errors (errors running the GPU code).

Asynchronous possibilities:

  • Memory copies with Async in the name
  • Kernels

Concurrency can also be increased by using CUDA streams. (A future post, or check google now 🙂 )

Note: The error checking functions could be written to wrap the CUDA functions like a macro, there are many possibilities. General approaches will work but customizing it to your needs may become more desirable as your code becomes more complex.

cudaError_t checkCUDAError(cudaError_t err, const char * name, int sync = 0);
cudaError_t checkCUDAError(cudaError_t err);

Part 2: Using the Error Checking Code Above

Next, we will look at some short samples that make use of the error checking topics just covered.

Use of Error Checking Macros

This sample shows the use of the error checking macros. It also includes other concepts, such as printing a subset of results or writing testing functions.

// 2_creatingErrorsMacro.cu
#include "errorCheckingMacro.cuh"
#include <stdio.h>
 
__global__ void kernelA(int * globalArray){
	int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
	globalArray[globalThreadId] = globalThreadId;
}

// Testing function
void testResults(int * data, int N) {
	// For more information incorrect values with their index's can be printed when found
	int testPass = true;
	for (int i = 0; i < N; i++) {
		if (data[i] != i) {
			testPass = false;
		}
	}
	printf("Result of test: %s\n\n", (testPass) ? "passed!" : "uh oh...");
}

int main()
{
	int elementCount = 10240;
	int dataSize = elementCount * sizeof(int);
	
	checkCUDAError(cudaSetDevice(0));
	
	int * managedArray;
	checkCUDAError(cudaMallocManaged(&amp;managedArray, dataSize));
 
	kernelA <<<4,1500>>>(managedArray); // Too many threads per block. 
	// Also, not enough threads for the amount of data but that is not the focus for this sample. 
 
	checkLastCUDAError_noAbort("kernelA");

	checkCUDAError(cudaDeviceSynchronize());
	

	// Can print a subset when problem size is larger
	int printStart = elementCount - 16;
	int printEnd = elementCount;
	printf("\nChecking values[%d-%d): ", printStart, printEnd); // Interval notation: https://en.wikipedia.org/wiki/Bracket_(mathematics)#Intervals
	for(int i = printStart; i < printEnd; i++){
		printf("%d%s", managedArray[i], (i < elementCount - 1) ? ", " : "\n");
	}	
	
	// Or better yet, write a testing function and let the computer test for you
	testResults(managedArray, elementCount);
	
	checkCUDAError(cudaFree(managedArray));
	
	checkCUDAError(cudaDeviceReset());
 
	return 0;
}
Sample's Output: 
CUDA error returned from "cudaSetDevice(0)" at 2_creatingErrorsMacro.cu:15, Error code: 0 (no error)
CUDA error returned from "cudaMallocManaged(&managedArray, dataSize)" at 2_creatingErrorsMacro.cu:18, Error code: 0 (no error)
CUDA error returned from "kernelA" at 2_creatingErrorsMacro.cu:22, Error code: 9 (invalid configuration argument)
CUDA error returned from "cudaDeviceSynchronize()" at 2_creatingErrorsMacro.cu:24, Error code: 0 (no error)
Checking values[10224-10240): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Result of test: uh oh…
CUDA error returned from "cudaFree(managedArray)" at 2_creatingErrorsMacro.cu:46, Error code: 0 (no error)
CUDA error returned from "cudaDeviceReset()" at 2_creatingErrorsMacro.cu:48, Error code: 0 (no error)

The error in this sample (error code: 9) on line 32 was caused by trying to use 1500 threads per block. The result of this error can also be seen when looking at the portion of results printed or by the result of the testing function.

Testing functions are generally a better approach as errors may occur in the portion of results that were not printed. Printing might often just be to make the developer more sure the results are correct by personally seeing them.

One detail that might seem strange is that I have added a feature to prevent the program from closing when an error occurs. From my experience with creating simulations, I have found that when an error occurs it may not always have an effect on the code. Depending on the situation it may be better to notify the user or silently log the error and program state, instead of closing the program immediately and frustrating the user.

Use of Error Checking Functions

The sample below was made to run on 32 elements but then modified to showcase errors when calling cudaSetDevice and cudaMallocManaged.

// 1_creatingErrors.cu
#include "errorChecking.cuh"
#include <stdio.h>

enum errorCheckFlag {
	NO_SYNC,
	SYNC
};

__global__ void kernelA(int * globalArray){
	int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
	globalArray[globalThreadId] = globalThreadId;
}
 
void helperFunction(int * managedMem){
	kernelA <<<4,8>>>(managedMem);
	checkCUDAError("<helperFunction> kernelA", SYNC); 
	// Showing a possible naming convention "<location>" for tracking down error   
	// locations when not using macros. A bit of work but it will save time.
}

int main()
{
	int elementCount = 32;
	int dataSize = elementCount * sizeof(int);
	
	cudaSetDevice(5); // The PC I am on does not have 6 GPU
	checkCUDAError("<main> cudaSetDevice");
	
	int * managedArray;
	cudaMallocManaged(&amp;managedArray, dataSize * 1000000000); // My GPU do not have 32 GB
	checkCUDAError("<main> cudaMallocManaged");

	kernelA <<<4,8>>>(managedArray);
	checkCUDAError("<main> kernelA", SYNC);

    //cudaDeviceSynchronize(); // checkCUDAError will sync with flag SYNC
	
	cudaFree(managedArray);
	checkCUDAError("<main> cudaFree");

	cudaDeviceReset();
	checkCUDAError("<main> cudaDeviceReset");
 
	return 0;
}

Sample's Output as is:
CUDA error check "<main> cudaSetDevice" returned ERROR code: 10 (invalid device ordinal)
CUDA error check "<main> cudaMallocManaged" returned ERROR code: 2 (out of memory)
CUDA error check "<main> kernelA" executed successfully
CUDA error check "<main> kernelA" returned ERROR code: 77 (an illegal memory access was encountered) after sync
CUDA error check "<main> cudaFree" returned ERROR code: 77 (an illegal memory access was encountered)
CUDA error check "<main> cudaDeviceReset" executed successfully

The first error (10) was caused because there is no GPU with the device ID of 5 on my system.
The next error (2) was caused because the code attempted to allocate more memory than was available.
Then error (77), in both locations, was a result of trying to work with memory that was never successfully allocated.

Sample's Output if Corrected:
CUDA error check "<main> cudaSetDevice" executed successfully
CUDA error check "<main> cudaMallocManaged" executed successfully
CUDA error check "<main> kernelA" executed successfully
CUDA error check "<main> kernelA" executed successfully after sync
CUDA error check "<main> cudaFree" executed successfully
CUDA error check "<main> cudaDeviceReset" executed successfully

With the errors corrected, a device ID of 0 and removing “* 1000000000”, everything reports running successfully. Printing success is usually not done though.

Conclusion

In this post, we covered the use of error checking functions and the various approaches needed for debugging synchronous and asynchronous GPU operations. Also, though out the post other common debugging methods where mentioned that might prove useful when writing GPU code.

At this point, you should be able to write and debug simple GPU programs if you have also gone through [post 1] and [post 2] previously.

Additional Information

If you were curious about the naming convention, I found starting folders or files with numbers, like 1_creatingErrors.cu, makes for easy use of tab’s autocomplete feature in terminal windows. It is only intended for initial work with code or samples though.

This post became a bit longer than previous samples, so I am including my simple makefile. A future post may take a closer look at makefiles, but as is there are many pages on makefiles across the internet.

NVCC = nvcc
CUDAFLAGS = -arch=sm_61 -lcudart
OPT = -m64

all: zero one two

errorChecking.obj: errorChecking.cu errorChecking.cuh
	${NVCC} ${CUDAFLAGS} ${OPT} -c errorChecking.cu -o errorChecking.obj

errorCheckingMacro.obj: errorCheckingMacro.cu errorCheckingMacro.cuh
	${NVCC} ${CUDAFLAGS} ${OPT} -c errorCheckingMacro.cu -o errorCheckingMacro.obj

zero: 0_creatingErrors.cu
	${NVCC} ${CUDAFLAGS} ${OPT} -o zero 0_creatingErrors.cu

one: 1_creatingErrors.cu errorChecking.obj
	${NVCC} ${CUDAFLAGS} ${OPT} -o one 1_creatingErrors.cu errorChecking.obj

two: 2_creatingErrorsMacro.cu errorCheckingMacro.obj
	${NVCC} ${CUDAFLAGS} ${OPT} -o two 2_creatingErrorsMacro.cu errorCheckingMacro.obj
	
clean:
	${RM} *.o *.obj *.exp *.pdb *.exe zero one two

Contact me if you would like to use the contents of this post. Thanks 🙂
Copyright © 2019 by Gregory Gutmann

This Post Has One Comment

  1. I like your idea to allow the annotation to the error check. Simple and useful.

    It’s easy extend this message with extra implicit automatic arguments of __FILE__ and __LINE__ via precompiler substitution to get a report like “CUDA error check ” cudaMallocManaged” returned ERROR code: 2 (out of memory) on line 345 of file cuda_test.cu”

Leave a Reply

Close Menu