Thread Indexing and Memory: CUDA Introduction Part 2

Author: Greg Gutmann
Affiliation: Tokyo Institute of Technology, Nvidia University Ambassador, Nvidia DLI 

Thread Indexing

In the part 1 post we ended with looking at launching kernels with multiple threads; however, they all did the exact same work. Probably not so useful. In the following code, the use of thread and block indices are shown, which can be used to differentiate work done by each thread. Either by working on different data, or in some cases taking different logic paths (branching, a future post).

#include <stdio.h>
 
__global__ void kernelA(){
	// threadIdx.x: The thread id with respect to the thread's block
	//              From 0 - (thread count per block - 1)
	
	// blockIdx.x:  The block id with respect to the grid (all blocks in the kernel)
	//              From 0 - (number of blocks launched - 1)
	
	// blockDim.x:  The number of threads in a block (block's dimension)
	//              Single value
	
	// globalThreadId: The id of the thread with respect to the whole kernel
	//                 From 0 - (kernel thread count - 1)
	
	int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
    
	printf("My threadIdx.x is %d, blockIdx.x is %d, blockDim.x is %d, Global thread id is %d\n",
			threadIdx.x, blockIdx.x, blockDim.x, globalThreadId);
}
 
int main()
{
    // Set which device should be used
    // The code will default to 0 if not called though
	cudaSetDevice(0);
 
    // Call a device function from the host: a kernel launch
    // Which will print from the device
	kernelA <<<2,2>>>();
 
    // This call waits for all of the submitted GPU work to complete
	cudaDeviceSynchronize();
	
	// Destroys and cleans up all resources associated with the current device. 
	// It will reset the device immediately. It is the caller's responsibility 
	//    to ensure that the device work has completed
	cudaDeviceReset();
	 
	return 0;
}

Sample’s output:

My threadIdx.x is 0, blockIdx.x is 0, blockDim.x is 2, Global thread id is 0
My threadIdx.x is 1, blockIdx.x is 0, blockDim.x is 2, Global thread id is 1
My threadIdx.x is 0, blockIdx.x is 1, blockDim.x is 2, Global thread id is 2
My threadIdx.x is 1, blockIdx.x is 1, blockDim.x is 2, Global thread id is 3

One of the things to point out is the use of the block and thread indices to compute a globalThreadId. This style of computing an index is used for many kernels. For example, a loop on the host with no dependencies between iterations can easily be converted to run on the GPU using this style of indexing.

At the end of the above sample cudaDeviceReset(); was added. It is a good practice to include this and it will help ensure profiling is successful (a future post).

The next is a skeleton code that shows one way to launch a large number of threads for a problemSize that could vary, by using ceil and a conditional statement in the kernel.  A potential way to way to parallelize a CPU loop on the GPU.

#include <stdio.h>
 
__global__ void kernelA(int N){
	int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
    
	// Conditional statement to exit if index (globalThreadId) is out of bounds
	if(globalThreadId >= N) {
		return;
	}

	//Insert code here
}
 
int main()
{
	// More realistic GPU problem size
	int problemSize = 1000000;

	cudaSetDevice(0);
	
	// On average a good thread count, the best thread count varies based on the situation
	int threadCount = 256;
	// Simple way to ensure enough threads are launched
	//    may result in launching more threads than needed though
	int blockCount = ceil(problemSize/threadCount);
	
	kernelA <<<blockCount, threadCount>>>(problemSize);
 
	cudaDeviceSynchronize();
	
	cudaDeviceReset();
	 
	return 0;
}

Predefined variables in CUDA Kernels

dim3 gridDim dimensions of grid
dim3 blockDim dimensions of block
uint3 blockIdx block index within grid
uint3 threadIdxthread index within block
int warpSize number of threads in warp

Out of the variables listed warp size hasn’t been mentioned yet. The threads on Nvidia’s GPU are organized into warps consisting of 32 threads. Each warp executes the same operation on multiple pieces of data, in the optimal situation. This is referred to as single instruction, multiple thread (SIMT). In the non-idea situation where not all threads perform the same operation warp divergence occurs, and operations are serialized. This will become more important as we look at optimization in the future.

Below is a sample code with the predefined variables for a look at their possible values for one thread. It also shows how both the block and thread dimensions can be three dimensional. 

#include <stdio.h>
 
__global__ void kernelA(){
	// Giant conditional so that it only prints once, this would not be done in pactice
	if (blockIdx.x == 0 && blockIdx.y == 1 && blockIdx.z == 0 && threadIdx.x == 1 && threadIdx.y == 0 && threadIdx.z == 1) {
		printf("gridDim   (%d, %d, %d)\n", gridDim.x, gridDim.y, gridDim.z);
		printf("blockDim  (%d, %d, %d)\n", blockDim.x, blockDim.y, blockDim.z);
		printf("blockIdx  (%d, %d, %d)\n", blockIdx.x, blockIdx.y, blockIdx.z);
		printf("threadIdx (%d, %d, %d)\n", threadIdx.x, threadIdx.y, threadIdx.z);
		printf("warpSize  (%d)\n", warpSize);
	}
}
 
int main()
{
	cudaSetDevice(0);
        
    // dim3 is an integer vector type
	dim3 blocks(50, 100, 50);
	dim3 threads(8, 8, 16);
	kernelA <<<blocks,threads>>>();
 
	cudaDeviceSynchronize();
	
	cudaDeviceReset();
	 
	return 0;
}

Sample’s output:

gridDim   (50, 100, 50)
blockDim  (8, 8, 16)
blockIdx  (0, 1, 0)
threadIdx (1, 0, 1)
warpSize  (32)

Multi-dimensional threads and blocks are typically used for problems that naturally map to multiple dimensions. For example, processing images by using a tiled approach. Where each block in the kernel gets a small square region of the image to work on, and the threadIdx.x and threadIdx.y map to the tile being worked on. 

Illustration of discretion above. Green block shows possible thread indexing. Blue, black and red show possible block indexing (following the same pattern as the threads).

Explicitly Copying Memory with CUDA

Originally explicit memory management was the only way to allocate device memory and manage it. It can require more work by the developer, in contrast to managed memory, but because of the resulting verbosity it is often easier to follow what actions the code is taking. Especially when trying to analyse certain code’s behaviour and performance. 

#include <stdio.h>
 
__global__ void kernelA(int * globalArray){
	int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
    
	globalArray[globalThreadId] = globalThreadId;
}
 
int main()
{
	int elementCount = 32;
	int dataSize = elementCount * sizeof(int);
	
	// malloc: allocates memory and returns a pointer to it (C/C++ function)
	int * hostArray = (int*)malloc(dataSize);
	
	cudaSetDevice(0);
	
	// Pointers to device memory are the same as host pointers
	int * deviceArray;
	// cudaMalloc: allocates device memory to the device that has been set
	cudaMalloc(&deviceArray, dataSize);
 
	kernelA <<<4,8>>>(deviceArray);
 
    cudaDeviceSynchronize();
	
	// cudaMemcpy: copies memory between the device and the host 
	cudaMemcpy(hostArray, deviceArray, dataSize, cudaMemcpyDeviceToHost);
	
	// Print out result (used conditional statement to make it look nice)
	for(int i = 0; i < elementCount; i++){
		printf("%d%s", hostArray[i], (i < elementCount - 1) ? ", " : "\n");
	}	
	
	// Free the host memory
	free(hostArray);
	
	// Free the device memory
	cudaFree(deviceArray);
	
	cudaDeviceReset();
 
	return 0;
}

Sample’s output:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31

Similar to the standard C/C++ functions for managing host memory malloc, memcpy and free; CUDA device memory is managed by the following cudaMalloc, cudaMemcpy and cudaFree. There are numerous others too but those three are a good start. (future post to come on memory)

Of the three cudaMemcpy is a bit different from its host counterpart. It takes a flag which declares where the source and destination memory are located, host or device. The enum is shown below:

enum cudaMemcpyKind (CUDA memory copy kind)

  • cudaMemcpyHostToHost = 0
    • Host -> Host
  • cudaMemcpyHostToDevice = 1
    • Host -> Device
  • cudaMemcpyDeviceToHost = 2
    • Device -> Host
  • cudaMemcpyDeviceToDevice = 3
    • Device -> Device
  • cudaMemcpyDefault = 4
    • Direction of the transfer is inferred from the pointer values. Requires unified virtual addressing

CUDA API: https://docs.nvidia.com/cuda/cuda-runtime-api

Managed Memory

Managed memory makes writing GPU code easier. However, there are some details that must be considered. 

  • Using managed memory efficiently can require equal effort to explicit memory management in some cases
  • Managed memory’s behaviour behind the scenes takes some time to understand but has benefits
  • With managed memory oversubscription is possible for the GPU (can allocate more GPU memory than your GPU has)

Below is a sample using managed memory. It uses the same kernel as the sample above and produces the same result.

#include <stdio.h>
 
__global__ void kernelA(int * globalArray){
	int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
    
	globalArray[globalThreadId] = globalThreadId;
}
 
int main()
{
	int elementCount = 32;
	int dataSize = elementCount * sizeof(int);
	
	cudaSetDevice(0);
	
	int * managedArray;
	// cudaMallocManaged: Allocates memory that will be automatically managed 
	//    by the Unified Memory system. Accessible by host and device
	cudaMallocManaged(&managedArray, dataSize);
 
	kernelA <<<4,8>>>(managedArray);
 
    cudaDeviceSynchronize();
	
	for(int i = 0; i < elementCount; i++){
		printf("%d%s", managedArray[i], (i < elementCount - 1) ? ", " : "\n");
	}	
	
	// Free the CUDA allocated memory
	cudaFree(managedArray);
	
	cudaDeviceReset();
 
	return 0;
}

Sample’s output:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31

This simple usage of managed memory has made the code shorter, and it is well suited to prototyping GPU algorithms. However, explicit memory usage or more in-depth usage of managed memory is typically needed to obtain desirable performance. 

Brief Comparison

To give some weight to what I said we will take a brief look at Nvidia’s nvprof command line profiler. 

profiler results

Focus is the yellow text, the GPU activities. The problem size profiled here (32 threads) is far smaller than would ever be run on the GPU.

The profiler result of the manual memory usage sample is shown first. The reported kernel time is 2.17us (microsecond) and the memory copy time is 1.22us. The other times will be looked at more closely in the future.

In contrast, the managed memory usage example only shows the kernel time which took 233.51us, 68x slower (the difference is exaggerated due to the size of the kernel). This time includes page faults followed by the memory movement needed to resolve the page faults.

There are ways to give hints to the system to ensure less or no performance loss. This will be covered in a future post as it is a fairly substantial topic all on its own. (Search cudaMemPrefetchAsync if you cannot wait.)

Conclusion

We covered thread indexing and basic memory usage in CUDA in this post. At this point you should have enough to start making simple functional GPU kernels. However, this assumes you do not run into any errors. In the next post we will cover debugging before going into more advanced topics. [Post 3]

Contact me if you would like to use the contents of this post. Thanks 🙂
Copyright © 2018 by Gregory Gutmann

Leave a Reply

Close Menu