My Own CURAND Usage
Through a little bit of experimenting with cuRAND I have come up with my own way of generating randoms with cuRAND. And to my knowledge, this approach produces equally random values, please comment if you know of any issues though.
My approach only consists of one step that includes initializing a temporary curandState then generating random values within the same function call or kernel.
The main distinction between the previous approach and this approach, is that this approach does not use the sequence parameter, sense using it significantly slows down the curand_init call. Instead, the curand_init seed is determined by:
- The seed that is passed into the kernel (seed) +
- The global thread ID (tid) +
- The number of times a random number has been generated by the calling thread (threadCallCount)
If the threadCallCount variable is not incremented each time when calling my getRandom function (below in code) the thread will produce the same random value each time. By incrementing threadCallCount for each call to getRandom, the seed used for curand_init is different each time, resulting in different random values for each call to getRandom.
Output of Code Below
Timing is in milliseconds
OneStepRandom 0.367, 0.202, 0.202, 0.724, 0.724, 0.838, 0.838, 0.646, Elapsed time generate 0.021 Shannon Entropy <13.258>
As seen by comparing the timing of each approach, my approach outlined here performs significantly better due to not needing to read from global memory. Also, it has the benefit of not needing to allocate an array of curandState and initialize them prior to generating randoms in a kernel call. The speedup over the previous sample is 6.6x; however, this impact of using this approach will vary based on your specific code.
Code
#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <curand.h> #include <curand_kernel.h> #include <stdint.h> #include <map> #include <stdio.h> void ShannonEntropy(int* data, int N, int& min, int& max, float& entropy); __device__ float getRandom(uint64_t seed, int tid, int threadCallCount) { curandState s; curand_init(seed + tid + threadCallCount, 0, 0, &s); return curand_uniform(&s); } __global__ void one_step_kernel(uint64_t seed, float* randoms) { int tid = threadIdx.x + blockIdx.x * blockDim.x; randoms[tid * 2 + 0] = getRandom(seed, tid, 0); randoms[tid * 2 + 1] = getRandom(seed, tid, 1); } int main() { printf("\nOneStepRandom\n"); int threads = 256; int blocks = 5120; int N = blocks * threads * 2; float* randomValues; float* host_randomValues; int* host_int; float time_elapsed; cudaEvent_t startTime; cudaEvent_t stopTime; cudaStream_t computeStream; host_randomValues = (float*)malloc(N * sizeof(float)); host_int = (int*)malloc(N * sizeof(float)); // Init device memory cudaMalloc(&randomValues, N * sizeof(float)); cudaEventCreate(&startTime); cudaEventCreate(&stopTime); cudaStreamCreateWithFlags(&computeStream, cudaStreamNonBlocking); // ----- Generate random numbers ----- cudaEventRecord(startTime, computeStream); one_step_kernel << < blocks, threads, 0, computeStream >> > (time(NULL), randomValues); cudaEventRecord(stopTime, computeStream); cudaEventSynchronize(stopTime); cudaEventElapsedTime(&time_elapsed, startTime, stopTime); // ----- Concluding Steps ----- cudaMemcpy(host_randomValues, randomValues, N * sizeof(float), cudaMemcpyDeviceToHost); // Convert floats to ints for the shannnon entropy function for (int i = 0; i < N; ++i) { // Print a few values out if (i < 8) { printf("%.3f, ", host_randomValues[i]); } host_int[i] = (int)(host_randomValues[i] * 10000.0f); } printf("\n"); printf("Elapsed time generate %9.3f\n", time_elapsed); int min, max; float entropy; ShannonEntropy(host_int, N, min, max, entropy); printf("Shannon Entropy <%6.3f>\n", entropy); cudaFree(randomValues); free(host_randomValues); free(host_int); return 0; } void ShannonEntropy(int* data, int N, int& min, int& max, float& entropy) { entropy = 0; // Init min = UINT_MAX; max = 0; std::map<int, long> counts; typename std::map<int, long>::iterator it; for (int dataIndex = 0; dataIndex < N; dataIndex++) { int dValue = data[dataIndex]; if (dValue < min) { min = dValue; } if (dValue > max) { max = dValue; } counts[dValue]++; } it = counts.begin(); while (it != counts.end()) { float p_x = (float)it->second / N; if (p_x > 0) entropy -= (float)(p_x * log(p_x) / log(2)); it++; } }
Contact me if you would like to use the contents of this post. Thanks.
Copyright © 2020 by Gregory Gutmann