using BenchmarkTools
using CUDA
CUDA.jl is the most mature and we will use it for the workshop. AMDGPU.jl is somewhat behind but still ready for general use, while oneAPI.jl and Metal.jl are functional but might contain bugs, miss some features and provide suboptimal performance.
What is the difference between a CPU and a GPU?
![No description has been provided for this image](./Graphics/cpu_vs_gpu.png)
![No description has been provided for this image](./Graphics/meme_gpu.jpg)
Some key aspects of GPUs that need to be kept in mind:
- The large number of compute elements on a GPU (in the thousands) can enable extreme scaling for data parallel tasks.
- GPUs have their own memory. This means that data needs to be transfered to and from the GPU during the execution of a program.
- Cores in a GPU are arranged into a particular structure. At the highest level they are divided into “streaming multiprocessors” (SMs). Some of these details are important when writing own GPU kernels.
![No description has been provided for this image](./Graphics/gpu.png)
- host: CPU + system memory (host memory)
- device: GPU with its memory (device memory)
- SM: Streaming Multiprocessor
Communication:
- Host-device bandwidth: 31.5 GB/s
- GPU global memory bandwidth: 1555 GB/s
GPU programming with Julia can be as simple as using a different array type instead of regular Base.Array
arrays:
CuArray
from CUDA.jl for NVIDIA GPUsROCArray
from AMDGPU.jl for AMD GPUsoneArray
from oneAPI.jl for Intel GPUsMtlArray
from Metal.jl for Apple GPUs
These array types are subtypes of GPUArrays
from GPUArrays.jl and closely resemble Base.Array
which enables us to write generic code which works on both CPU and GPU arrays.
if CUDA.functional()
A_d = CuArray([1,2,3,4])
A_d .+= 1
end
We can do the same operation with other subtypes of GPUArrays
:
if AMDGPU.functional()
A_d = ROCArray([1,2,3,4])
A_d .+= 1
end
if oneAPI.functional()
A_d = oneArray([1,2,3,4])
A_d .+= 1
end
A_d = MtlArray([1,2,3,4])
A_d .+= 1
Moving an array back from the GPU to the CPU is simple:
if CUDA.functional()
A = Array(A_d)
end
However, the overhead of copying data to the GPU makes such simple calculations very slow.
Let’s have a look at a more realistic example: matrix multiplication. We create two random arrays, one on the CPU and one on the GPU, and compare the performance:
if CUDA.functional()
A = rand(2^12, 2^12)
A_d = CuArray(A)
@btime $A * $A
CUDA.@time A_d * A_d
end
if CUDA.functional()
A = rand(Float32, 2^12, 2^12)
A_d = CuArray(A)
@btime $A * $A
CUDA.@time A_d * A_d
end
GPUs normally perform significantly better for 32-bit floats. Some GPUs doesn't support 64-bit floats!
Many array operations in Julia are implemented using loops, processing one element at a time. Doing so with GPU arrays is very ineffective, as the loop won't actually execute on the GPU, but transfer one element at a time and process it on the CPU. As this wrecks performance, you will be warned when performing this kind of iteration:
if CUDA.functional()
A_d[1] = 3.0
end
Scalar indexing is only allowed in an interactive session, e.g. the REPL, because it is convenient when porting CPU code to the GPU. If you want to disallow scalar indexing, e.g. to verify that your application executes correctly on the GPU, call the allowscalar function:
if CUDA.functional()
CUDA.allowscalar(false)
A_d[1] = 3.0
end
In a non-interactive session, e.g. when running code from a script or application, scalar indexing is disallowed by default. There is no global toggle to allow scalar indexing; if you really need it, you can mark expressions using allowscalar with do-block syntax or @allowscalar
macro:
if CUDA.functional()
CUDA.allowscalar(false)
CUDA.allowscalar() do
A_d[1] += 1
end
CUDA.@allowscalar A_d[1] += 1
end
Nvidia provides CUDA toolkit, a collection of libraries that contain precompiled kernels for common operations like matrix multiplication (cuBLAS), fast Fourier transforms (cuFFT), linear solvers (cuSOLVER), sparse linear algebra (CUSPARSE), etc. These kernels are wrapped in CUDA.jl and can be used directly with CuArrays.
The recommended way to use CUDA.jl is to let it automatically download an appropriate CUDA toolkit. CUDA.jl will check your driver's capabilities, which versions of CUDA are available for your platform, and automatically download an appropriate artifact containing all the libraries that CUDA.jl supports.
CUDA.set_runtime_version!( v"11.8" )
To use a local installation, you can invoke the same API but set the version to "local"
:
CUDA.set_runtime_version!( local_toolkit=true )
if CUDA.functional()
CUDA.versioninfo()
end
Let's do a guided tour of what is inside CUDA.jl!
if CUDA.functional()
using CUDA.CUBLAS
using CUDA.CUFFT
using CUDA.CUSOLVER
using CUDA.CUSPARSE
end
A powerful way to program GPUs with arrays is through Julia’s higher-order array abstractions.
The simple element-wise addition we saw above, a .+= 1
, is an example of this, but more general constructs can be created with broadcast
, map
, reduce
, accumulate
etc:
if CUDA.functional()
broadcast(-, A_d, 1)
end
if CUDA.functional()
map(x -> x+1, A_d)
end
if CUDA.functional()
reduce(+, A_d)
end
if CUDA.functional()
accumulate(+, A_d)
end
Using the high-level GPU array functionality made it easy to perform this computation on the GPU. However, we didn't learn about what's going on under the hood, and that's the main goal of this tutorial. It's time to write our own kernels!
function vadd!(C, A, B)
for i in 1:length(A)
@inbounds C[i] = A[i] + B[i]
end
return nothing
end
A = ones(10)
B = ones(10)
C = similar(B)
vadd!(C, A, B)
C
if CUDA.functional()
# We can already run this on the GPU with the @cuda macro,
# which will compile vadd!() into a GPU kernel and launch it
A_d = CuArray(A)
B_d = CuArray(B)
C_d = similar(B_d)
@cuda vadd!(C_d, A_d, B_d)
C_d
end
The macros for the other GPU backends are @roc
, @oneapi
and @metal
.
The performance are just terrible because each thread on the GPU would be performing the same loop! So we have to remove the loop over all elements and instead use the special threadIdx
and blockDim
functions, analogous respectively to threadid
and nthreads
for multithreading.
We can split work between the GPU threads by using a special function which returns the index of the GPU thread which executes it.
GPU kernel: a function that will be executed by all GPU threads in parallel.
Based on the index of a thread we can make them operate on different pieces of give n data.
(It might be helpful to think of the GPU kernel as being the body of a loop.)
function vadd2!(C, A, B)
index = threadIdx().x # linear indexing, so only use `x`
@inbounds C[index] = A[index] + B[index]
return nothing
end
if CUDA.functional()
N = 2^8
A = 2 * CUDA.ones(N)
B = 3 * CUDA.ones(N)
C = similar(B)
nthreads = N
@cuda threads=nthreads vadd2!(C, A, B)
end
if CUDA.functional()
all(Array(C) .== 5.0)
end
The syntax is similar for the other GPU backends!
groupsize = length(A)
@roc groupsize=groupsize vadd!(C, A, B)
items = length(A)
@oneapi items=items vadd!(C, A, B)
nthreads = length(A)
@metal threads=nthreads vadd!(C, A, B)
To do even better, we need to parallelize more. GPUs have a limited number of threads they can run on a single streaming multiprocessor (SM), but they also have multiple SMs. To take advantage of them all, we need to run a kernel with multiple blocks. We'll divide up the work like this:
Conceptual mapping:
- Grid of blocks → entire GPU
- Blocks of threads → SMs
- Threads → CUDA cores
Note: up to three dimensions, $(x, y, z)$, can be used to organize the thread blocks and threads in each block.
This diagram was borrowed from a description of the NVIDIA C/C++ library; in Julia, threads and blocks begin numbering with 1 instead of 0. In this diagram, the 4096 blocks of 256 threads (making 1048576 = 2^20 threads) ensures that each thread increments just a single entry; however, to ensure that arrays of arbitrary size can be handled, let's still use a loop:
function vadd3!(C, A, B)
index = threadIdx().x + (blockIdx().x - 1) * blockDim().x
stride = gridDim().x * blockDim().x
for i = index:stride:length(B)
@inbounds C[index] = A[index] + B[index]
end
end
if CUDA.functional()
nthreads = CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
end
The maximum number of allowed threads to launch depends on your GPU!
if CUDA.functional()
N = 2^14
A = 2 * CUDA.ones(N)
B = 3 * CUDA.ones(N)
C = similar(B)
# smallest integer larger than or equal to N / nthreads
numblocks = ceil(Int, N/nthreads)
end
if CUDA.functional()
@cuda threads=nthreads blocks=numblocks vadd3!(C, A, B)
end
all(Array(C) .== 5.0)
CUDA.jl supports indexing in up to 3 dimensions (x, y and z, e.g. threadIdx().z
). This is convenient for multidimensional data where thread blocks can be organised into 1D, 2D or 3D arrays of threads.
To automatically select an appropriate number of threads, it is recommended to use the launch configuration API. This API takes a compiled (but not launched) kernel, returns a tuple with an upper bound on the number of threads, and the minimum number of blocks that are required to fully saturate the GPU:
To optimize the number of threads, we can first create the kernel without launching it, query it for the number of threads supported, and then launch the compiled kernel:
# compile kernel
kernel = @cuda launch=false vadd3!(C, A, B)
# extract configuration via occupancy API
config = launch_configuration(kernel.fun)
# number of threads should not exceed size of array
threads = min(length(A), config.threads)
# smallest integer larger than or equal to length(A)/threads
blocks = cld(length(A), threads)
# launch kernel with specific configuration
kernel(C, A, B; threads, blocks)
Debugging: Many things can go wrong with GPU kernel programming and unfortunately error messages are sometimes not very useful because of how the GPU compiler works.
Conventional print-debugging is often a reasonably effective way to debug GPU code. CUDA.jl provides macros that facilitate this:
@cushow
(like @show): visualize an expression and its result, and return that value.@cuprintln
(like println): to print text and values.@cuaassert
(like @assert) can also be useful to find issues and abort execution.
GPU code introspection macros also exist, like @device_code_warntype
, to track down type instabilities.
function gpu_add_print!(y, x)
index = threadIdx().x # this example only requires linear indexing, so just use `x`
stride = blockDim().x
@cuprintln("thread $index, block $stride")
for i = index:stride:length(y)
@inbounds y[i] += x[i]
end
return nothing
end
if CUDA.functional()
x_d = CUDA.rand(10)
y_d = CUDA.rand(10)
@cuda threads=10 gpu_add_print!(y_d, x_d)
synchronize()
end
Conclusion: Keep in mind that the high-level functionality of CUDA often means that you don't need to worry about writing kernels at such a low level. However, there are many cases where computations can be optimized using clever low-level manipulations. The kernels implemented in Julia give you all the flexibility and performance a GPU has to offer, within a familiar language.
A typical approach for porting or developing an application for the GPU is as follows:
- develop an application using generic array functionality, and test it on the CPU with the
Array
type; - port your application to the GPU by switching to the
CuArray
type; - disallow the CPU fallback ("scalar indexing") to find operations that are not implemented for or incompatible with GPU execution;
- (optional) use lower-level, CUDA-specific interfaces to implement missing functionality or optimize performance.
Exercise: GPU-port the sqrt_sum
function we saw in te first notebook:
function sqrt_sum(A)
T = eltype(A)
s = zero(T)
for i in eachindex(A)
@inbounds s += sqrt(A[i])
end
return s
end
References:¶
- https://enccs.github.io/Julia-for-HPC/GPU/
- https://cuda.juliagpu.org/stable/
- https://www.youtube.com/watch?v=Fz-ogmASMAE
- https://www.cherryservers.com/blog/gpu-vs-cpu-what-are-the-key-differences
- https://developer.nvidia.com/blog/tag/cuda-refresher/
- https://i.redd.it/yr9h5cpyzpn21.jpg
- https://docs.nvidia.com/cuda/
- https://www.youtube.com/watch?v=Hz9IMJuW5hU
- https://julialang.org/learning/