Special syntaxes of CUDA C++ or C++
(void**)&ptr: call out the address of the pointer.(void *)kernelFunc: similar to (void *)&kernelFunc, casting a C func to a generic pointer that is required individually by CUDA 12+. Why generic? because kernelFunc is a pointer already.- What
**?:- int a = 10; int* p1 = &a; int** p2 = &p1; (so p2 is a ptr to ptr p1 that points to a). So, void** means kernel wants to modify a ptr.
- CUDA prototype explaination:
cudaError_t cudaMalloc(void **devPtr, size_t size)- What
void*?: it is pointer to any type in C++, just generic/general pointer with no specific type or any type is OK. - So,
void* *devPtris just ptr to ptr, meaning we just want to change the pointer of “devPtr”, not the value/address it contains.- We can call:
cudaMalloc(&devPtr, size), now we exactly have address of devPtr, not the value/address it contains, we can modify/add memory for it.
- We can call:
- If like this
cudaError_t cudaMalloc(void *devPtr, size_t size)- Then, we can:
cudaMalloc(devPtr, size), so CUDA only take the value that devPtr brings, not its pointer, so CUDA cannot allocate memory for devPtr.
- Then, we can:
- NOTE: cudaMalloc() will allocate a space of device memory for a host pointer.
- What
- Same for :
cudaError_t cudaGetSymbolAddress(void **devPtr, const void *symbol);
float* ptrvsfloat* ptr[2]vsfloat** ptrvsfloat** ptr[2]:float* ptr: ptr is a pointer that points to a single float or array of floats.float* ptr[2]: ptr is not a pointer, it is an array of 2 pointers containing float.float** ptr: ptr is a pointer that either points another pointer or containing address of another pointer (float** ptr = &another_ptr;)float** ptr[2]: an arr of two pointers, each points to another pointer.
-
ptrs + n: in C++ or Cuda, we always meet the weird calculation: a array pointer can be added an integer. Well, it’s just a short way to call the address ofptrs[n]or exactly&ptrs[n]. - Clamping VS “modulo_wrapping”: both of them define a way to distribute data into different bins/groups. For instance, we have nbins = 4 (we divide in to 4 groups: 0,1,2,3):
-
data -2 -1 0 1 2 3 4 5 clamping_binid/group 0 0 0 1 2 3 3 3 modulo_binid/group 2 3 0 1 2 3 0 1 - Modulo (%: phần dư của phép chia hai số nguyên) :
- Best suitable for circular/repeating data.
- Outliers will be distributed evenly around all bins.
-
int binid = value % nbins; if (binid < 0) binid += nbins; // handle negatives safely
- Clamping:
- Good for bounded data, like image pixels, limited sensor data range.
- Outliers will be accumulated on edge/border.
-
int binid = ldata; if (ldata < 0) binid = 0; else if (ldata >= nbins) binid = nbins - 1;
-
Atomicoperations: refers to an operation that is only executed by one thread until it finishes read-modify-write process safely and fully. For example: int x=0; x = x+1;. If (x=x+1) is not atomic, thread A and B might do the increment at the same time but x might be only 2, not 3 after 2 times of incrementing. So, we need set (x=x+1) atomic.- In plain C++ : std::atomic
counter{0}; - In Cuda:
__managed__cuda::atomic<int, cuda::thread_scope_system> counter(0); (__managed__refers to"unified memory"that is accessable to both host and device).
- In plain C++ : std::atomic