Launching CUDA Functions: CUDA Introduction Part 1

Author: Greg Gutmann
Affiliation: Tokyo Institute of Technology, Nvidia University Ambassador, Nvidia DLI 

Prerequisites: some C/C++ familiarity and technical background, or good Googling skills.

Introduction

Many have discussed the benefits of using GPU vs. CPU already. Here is a post from Medium that discusses GPU in a clear and simple manner, they also link to a nice video by the Myth Busters, Adam Savage and Jamie Hyneman.

Post: https://medium.com/altumea/gpu-vs-cpu-computing-what-to-choose-a9788a2370c4
Video: https://www.youtube.com/watch?v=-P28LKWTzrI

I may discuss GPU vs. CPU from a low-level perspective in the future, which has not been done as much. But for the purpose of this post that will not be needed. This post aims at getting you coding right away.

I used Windows for writing the samples for this post, they should run on Linux with little or no changes though. I will save the comparison of running CUDA code on Windows vs. Linux for a future post as well. 

Using NVCC: Nvidia’s CUDA Compiler

NVCC is a C/C++ compiler with regards to the host code. Therefore, programming in either language’s style, or a mix of the two, is accepted by the compiler. For example, the following two code samples can both be compiled with NVCC. 

// Programming style: C
// File name: CUDA_c.cu
// Compiler command: nvcc CUDA_c.cu
#include 

int main()
{
   // printf() displays the string inside quotation
   printf("Hello, World!\n");
   return 0;
}

// Programming style: C++
// File name: CUDA_cpp.cu
// Compiler command: nvcc CUDA_cpp.cu
#include 

int main() 
{
   // cout displays the string inside quotation
   std::cout << "Hello, World!\n";
   return 0;
}

The result of compiling these will produce an executable a.exe
Alternatively the following compile command with -o could be used to name the produced executable.

nvcc CUDA_c.cu -o myApp

myApp.exe will be the result on Windows.

Writing CUDA Code that Runs on the GPU

Lets start with defining some terms then on to code:

  • device = GPU
  • host = CPU
  • kernel = a GPU function that is called from the CPU

#include 

// __global__ keyword specifies a device kernel function
__global__ void kernelA(){
	printf("Hello, from the GPU!\n");
}

int main()
{
	printf("Hello, from the CPU!\n");
	
	// Set which device should be used
	// The code will default to 0 if not called though
	cudaSetDevice(0);

	// Call a device function from the host: a kernel launch
	// Which will print from the device
	kernelA <<<1,1>>>();

	// This call waits for all of the submitted GPU work to complete
	cudaDeviceSynchronize();

   return 0;
}

Sample’s output

Hello, from the CPU!
Hello, from the GPU!

The sample here shows everything that is needed to run code on a GPU, and a few things that are recommended. You may have noticed the code sample only prints out lines of text though.

The required parts are:

  • Using the __global__ keyword for the functions that will be called from the host and run on the device
  • Using the <<< , >>> angle brackets to mark a call from host code to device code

The recommended parts are:

  • Calling cudaSetDevice(int  device); to specify which device should be used
  • Calling cudaDeviceSynchronize(); after the kernel call, to ensure the device code completes before the main code returns. Kernel launches are asynchronous, meaning the host does not wait for the kernel to return before continuing on.

A Deeper Look at CUDA Function Keywords

This section will jump to slightly more advanced function usage, but related to the last section. Feel free to skip this section and come back to it as needed.

#include 

// __device__ keyword specifies a function that is run on the device and called from a kernel (1a)
__device__ void GPUFunction(){ 
	printf("\tHello, from the GPU! (1a)\n");
}

// This is a kernel that calls a decive function (1b)
__global__ void kernelA(){
	GPUFunction();
}

// __host__ __device__ keywords can be specified if the function needs to be 
//					   available to both the host and device (2a)
__host__ __device__ void versatileFunction(){
	printf("\tHello, from the GPU or CPU! (2a)\n");
}

// This is a kernel that calls a function on the device (2b)
__global__ void kernelB(){
	versatileFunction();
}

int main()
{
	cudaSetDevice(0);

	//	Launch a kernel, that will print from a function called by device code (1b -> 1a)
	printf("\nLaunching kernel 1b\n");
	kernelA<<<1,1>>>();

	cudaDeviceSynchronize();

	// Call a function from the host (2a)
	printf("\nCalling host function 2a\n");
	versatileFunction();

	// Call the same function from the device (2b -> 2a)
	printf("\nLaunching kernel 2b\n");
	kernelB<<<1,1>>>();

	cudaDeviceSynchronize();

   return 0;
}

Sample’s output (Function calls can be followed by name or by (#))

Launching kernel 1b
        Hello, from the GPU! (1a)

Calling host function 2a
        Hello, from the GPU or CPU! (2a)

Launching kernel 2b
        Hello, from the GPU or CPU! (2a)

The above sample shows the use of the keywords: 

  • __device__ a function called from a device and run on the device (called from inside of a kernel)
  • __host__ a function called from the host and run on the host

As well as showing that both keywords can be used for a single function.
An example of why this would be beneficial: When writing your own math operations that both the host and device will need to make use of. 

In the sample above cudaDeviceSynchronize(); is called after every kernel launch to make the output easy to read. Typically cudaDeviceSynchronize(); is used only when it is absolutely necessary as it stalls the code exaction, and excessive calls will likely slow down the code.

Launching Functions With More Than One Thread

Till now all of the GPU functions that we have launched have only used one thread. Given that the GPU is often referred to as a massively parallel processing unit, using a single thread is not making the best use of the GPU.

On to the next code sample:

#include 

__global__ void kernelA(){
	printf("Hello, from kernelA\n");
}

__global__ void kernelB(){
	printf("Hello, from kernelB\n");
}

int main()
{
	cudaSetDevice(0);
	
	printf("--------- Example 1 ---------\n");
	
	int blockCount = 1; 
	int threadCount = 4;

	// Calling a kenel with 1 block that contains 4 threads
	// Launching a total of 4 threads
	kernelA<<<blockCount,threadCount>>>();
	
	cudaDeviceSynchronize();

	printf("\n--------- Example 2 ---------\n");

	blockCount = 3; 
	threadCount = 2;

	// Calling a kenel with 3 blocks that each contain 2 threads
	// Launching a total of 6 threads
	kernelB<<<blockCount,threadCount>>>();

	cudaDeviceSynchronize();

   return 0;
}

Sample’s output:

--------- Example 1 ---------
Hello, from kernelA
Hello, from kernelA
Hello, from kernelA
Hello, from kernelA

--------- Example 2 ---------
Hello, from kernelB
Hello, from kernelB
Hello, from kernelB
Hello, from kernelB
Hello, from kernelB
Hello, from kernelB

The main change in this sample vs. the first section is that the <<< , >>> angle brackets have been filled in. The first number specifies the number of blocks launched and the second number specifies the number of threads per block. Therefore, in this example, the number of blocks * the number of threads gives the total thread count of the kernel.

Thread Hierarchy: CUDA Grid, Blocks and Threads

Threads launched by kernels are organized into blocks which are then organized into a grid.  Or another way of saying it is:

  • A kernel launches a grid
  • A grid contains numerous blocks
  • A block contains numerous threads
  • Each thread runs the code that is contained in the kernel function

Thread configuration for the samples above

 

Graphical representation of the previous sample’s kernel launches. As seen the block and threads are indexed starting at 0, which will become relevant later (next post)

A kernel launch must contain at least 1 block and 1 thread for any code to be run.

  • The maximum number of threads per block is 1024
  • The maximum number of blocks using the launch configuration in the above sample is 231 − 1 (2,147,483,647)

This results in a maximum kernel thread count of over 2 trillion. Most likely this is sufficient for any problem; however, if it is not the parallel launch will need to be broken up into multiple kernel calls. 

For more specific information on device maximums see the following:

        Web page: https://en.wikipedia.org/wiki/CUDA
        Section: Version features and specifications
        Table: Technical specifications

Blocks and threads can be organized into multiple dimensions too; however I will leave that for the next post as this one seems sufficiently long. Part 2 here.

Contact me if you would like to use the contents of this post. Thanks 🙂
Copyright © 2018 by Gregory Gutmann

Leave a Reply

Close Menu