Author: Greg Gutmann
Affiliation: Tokyo Institute of Technology, Nvidia University Ambassador, Nvidia DLI
Thread Indexing
In the part 1 post we ended with looking at launching kernels with multiple threads; however, they all did the exact same work. Probably not so useful. In the following code, the use of thread and block indices are shown, which can be used to differentiate work done by each thread. Either by working on different data, or in some cases taking different logic paths (branching, a future post).
#include <stdio.h>
__global__ void kernelA(){
// threadIdx.x: The thread id with respect to the thread's block
// From 0 - (thread count per block - 1)
// blockIdx.x: The block id with respect to the grid (all blocks in the kernel)
// From 0 - (number of blocks launched - 1)
// blockDim.x: The number of threads in a block (block's dimension)
// Single value
// globalThreadId: The id of the thread with respect to the whole kernel
// From 0 - (kernel thread count - 1)
int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
printf("My threadIdx.x is %d, blockIdx.x is %d, blockDim.x is %d, Global thread id is %d\n",
threadIdx.x, blockIdx.x, blockDim.x, globalThreadId);
}
int main()
{
// Set which device should be used
// The code will default to 0 if not called though
cudaSetDevice(0);
// Call a device function from the host: a kernel launch
// Which will print from the device
kernelA <<<2,2>>>();
// This call waits for all of the submitted GPU work to complete
cudaDeviceSynchronize();
// Destroys and cleans up all resources associated with the current device.
// It will reset the device immediately. It is the caller's responsibility
// to ensure that the device work has completed
cudaDeviceReset();
return 0;
}
Sample’s output:
My threadIdx.x is 0, blockIdx.x is 0, blockDim.x is 2, Global thread id is 0
My threadIdx.x is 1, blockIdx.x is 0, blockDim.x is 2, Global thread id is 1
My threadIdx.x is 0, blockIdx.x is 1, blockDim.x is 2, Global thread id is 2
My threadIdx.x is 1, blockIdx.x is 1, blockDim.x is 2, Global thread id is 3
One of the things to point out is the use of the block and thread indices to compute a globalThreadId. This style of computing an index is used for many kernels. For example, a loop on the host with no dependencies between iterations can easily be converted to run on the GPU using this style of indexing.
At the end of the above sample cudaDeviceReset(); was added. It is a good practice to include this and it will help ensure profiling is successful (a future post).
The next is a skeleton code that shows one way to launch a large number of threads for a problemSize that could vary, by using ceil and a conditional statement in the kernel. A potential way to way to parallelize a CPU loop on the GPU.
#include <stdio.h>
__global__ void kernelA(int N){
int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
// Conditional statement to exit if index (globalThreadId) is out of bounds
if(globalThreadId >= N) {
return;
}
//Insert code here
}
int main()
{
// More realistic GPU problem size
int problemSize = 1000000;
cudaSetDevice(0);
// On average a good thread count, the best thread count varies based on the situation
int threadCount = 256;
// Simple way to ensure enough threads are launched
// may result in launching more threads than needed though
int blockCount = ceil(problemSize/threadCount);
kernelA <<<blockCount, threadCount>>>(problemSize);
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
Predefined variables in CUDA Kernels
dim3 gridDim | dimensions of grid |
dim3 blockDim | dimensions of block |
uint3 blockIdx | block index within grid |
uint3 threadIdx | thread index within block |
int warpSize | number of threads in warp |
Out of the variables listed warp size hasn’t been mentioned yet. The threads on Nvidia’s GPU are organized into warps consisting of 32 threads. Each warp executes the same operation on multiple pieces of data, in the optimal situation. This is referred to as single instruction, multiple thread (SIMT). In the non-idea situation where not all threads perform the same operation warp divergence occurs, and operations are serialized. This will become more important as we look at optimization in the future.
Below is a sample code with the predefined variables for a look at their possible values for one thread. It also shows how both the block and thread dimensions can be three dimensional.
#include <stdio.h>
__global__ void kernelA(){
// Giant conditional so that it only prints once, this would not be done in pactice
if (blockIdx.x == 0 && blockIdx.y == 1 && blockIdx.z == 0 && threadIdx.x == 1 && threadIdx.y == 0 && threadIdx.z == 1) {
printf("gridDim (%d, %d, %d)\n", gridDim.x, gridDim.y, gridDim.z);
printf("blockDim (%d, %d, %d)\n", blockDim.x, blockDim.y, blockDim.z);
printf("blockIdx (%d, %d, %d)\n", blockIdx.x, blockIdx.y, blockIdx.z);
printf("threadIdx (%d, %d, %d)\n", threadIdx.x, threadIdx.y, threadIdx.z);
printf("warpSize (%d)\n", warpSize);
}
}
int main()
{
cudaSetDevice(0);
// dim3 is an integer vector type
dim3 blocks(50, 100, 50);
dim3 threads(8, 8, 16);
kernelA <<<blocks,threads>>>();
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
Sample’s output:
gridDim (50, 100, 50)
blockDim (8, 8, 16)
blockIdx (0, 1, 0)
threadIdx (1, 0, 1)
warpSize (32)
Multi-dimensional threads and blocks are typically used for problems that naturally map to multiple dimensions. For example, processing images by using a tiled approach. Where each block in the kernel gets a small square region of the image to work on, and the threadIdx.x and threadIdx.y map to the tile being worked on.

Explicitly Copying Memory with CUDA
Originally explicit memory management was the only way to allocate device memory and manage it. It can require more work by the developer, in contrast to managed memory, but because of the resulting verbosity it is often easier to follow what actions the code is taking. Especially when trying to analyse certain code’s behaviour and performance.
#include <stdio.h>
__global__ void kernelA(int * globalArray){
int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
globalArray[globalThreadId] = globalThreadId;
}
int main()
{
int elementCount = 32;
int dataSize = elementCount * sizeof(int);
// malloc: allocates memory and returns a pointer to it (C/C++ function)
int * hostArray = (int*)malloc(dataSize);
cudaSetDevice(0);
// Pointers to device memory are the same as host pointers
int * deviceArray;
// cudaMalloc: allocates device memory to the device that has been set
cudaMalloc(&deviceArray, dataSize);
kernelA <<<4,8>>>(deviceArray);
cudaDeviceSynchronize();
// cudaMemcpy: copies memory between the device and the host
cudaMemcpy(hostArray, deviceArray, dataSize, cudaMemcpyDeviceToHost);
// Print out result (used conditional statement to make it look nice)
for(int i = 0; i < elementCount; i++){
printf("%d%s", hostArray[i], (i < elementCount - 1) ? ", " : "\n");
}
// Free the host memory
free(hostArray);
// Free the device memory
cudaFree(deviceArray);
cudaDeviceReset();
return 0;
}
Sample’s output:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
Similar to the standard C/C++ functions for managing host memory malloc, memcpy and free; CUDA device memory is managed by the following cudaMalloc, cudaMemcpy and cudaFree. There are numerous others too but those three are a good start. (future post to come on memory)
Of the three cudaMemcpy is a bit different from its host counterpart. It takes a flag which declares where the source and destination memory are located, host or device. The enum is shown below:
enum cudaMemcpyKind (CUDA memory copy kind)
- cudaMemcpyHostToHost = 0
- Host -> Host
- cudaMemcpyHostToDevice = 1
- Host -> Device
- cudaMemcpyDeviceToHost = 2
- Device -> Host
- cudaMemcpyDeviceToDevice = 3
- Device -> Device
- cudaMemcpyDefault = 4
- Direction of the transfer is inferred from the pointer values. Requires unified virtual addressing
CUDA API: https://docs.nvidia.com/cuda/cuda-runtime-api
Managed Memory
Managed memory makes writing GPU code easier. However, there are some details that must be considered.
- Using managed memory efficiently can require equal effort to explicit memory management in some cases
- Managed memory’s behaviour behind the scenes takes some time to understand but has benefits
- With managed memory oversubscription is possible for the GPU (can allocate more GPU memory than your GPU has)
Below is a sample using managed memory. It uses the same kernel as the sample above and produces the same result.
#include <stdio.h>
__global__ void kernelA(int * globalArray){
int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
globalArray[globalThreadId] = globalThreadId;
}
int main()
{
int elementCount = 32;
int dataSize = elementCount * sizeof(int);
cudaSetDevice(0);
int * managedArray;
// cudaMallocManaged: Allocates memory that will be automatically managed
// by the Unified Memory system. Accessible by host and device
cudaMallocManaged(&managedArray, dataSize);
kernelA <<<4,8>>>(managedArray);
cudaDeviceSynchronize();
for(int i = 0; i < elementCount; i++){
printf("%d%s", managedArray[i], (i < elementCount - 1) ? ", " : "\n");
}
// Free the CUDA allocated memory
cudaFree(managedArray);
cudaDeviceReset();
return 0;
}
Sample’s output:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
This simple usage of managed memory has made the code shorter, and it is well suited to prototyping GPU algorithms. However, explicit memory usage or more in-depth usage of managed memory is typically needed to obtain desirable performance.
Brief Comparison
To give some weight to what I said we will take a brief look at Nvidia’s nvprof command line profiler.

Focus is the yellow text, the GPU activities. The problem size profiled here (32 threads) is far smaller than would ever be run on the GPU.
The profiler result of the manual memory usage sample is shown first. The reported kernel time is 2.17us (microsecond) and the memory copy time is 1.22us. The other times will be looked at more closely in the future.
In contrast, the managed memory usage example only shows the kernel time which took 233.51us, 68x slower (the difference is exaggerated due to the size of the kernel). This time includes page faults followed by the memory movement needed to resolve the page faults.
There are ways to give hints to the system to ensure less or no performance loss. This will be covered in a future post as it is a fairly substantial topic all on its own. (Search cudaMemPrefetchAsync if you cannot wait.)
Conclusion
We covered thread indexing and basic memory usage in CUDA in this post. At this point you should have enough to start making simple functional GPU kernels. However, this assumes you do not run into any errors. In the next post we will cover debugging before going into more advanced topics. [Post 3]
Contact me if you would like to use the contents of this post. Thanks 🙂
Copyright © 2018 by Gregory Gutmann