Parallel computing and GPU programming with Julia¶

Part III: GPU programming¶

Alexis Montoison

In [ ]:

using BenchmarkTools
using CUDA

Julia has first-class support for GPU programming through the following packages that target GPUs from major vendors:

CUDA.jl for NVIDIA GPUs
AMDGPU.jl for AMD GPUs
oneAPI.jl for Intel GPUs
Metal.jl for Apple M-series GPUs

CUDA.jl is the most mature and we will use it for the workshop. AMDGPU.jl is somewhat behind but still ready for general use, while oneAPI.jl and Metal.jl are functional but might contain bugs, miss some features and provide suboptimal performance.

What is the difference between a CPU and a GPU?

No description has been provided for this image

Some key aspects of GPUs that need to be kept in mind:

The large number of compute elements on a GPU (in the thousands) can enable extreme scaling for data parallel tasks.
GPUs have their own memory. This means that data needs to be transfered to and from the GPU during the execution of a program.
Cores in a GPU are arranged into a particular structure. At the highest level they are divided into “streaming multiprocessors” (SMs). Some of these details are important when writing own GPU kernels.

host: CPU + system memory (host memory)
device: GPU with its memory (device memory)
SM: Streaming Multiprocessor

Communication:

Host-device bandwidth: 31.5 GB/s
GPU global memory bandwidth: 1555 GB/s

GPU programming with Julia can be as simple as using a different array type instead of regular Base.Array arrays:

CuArray from CUDA.jl for NVIDIA GPUs
ROCArray from AMDGPU.jl for AMD GPUs
oneArray from oneAPI.jl for Intel GPUs
MtlArray from Metal.jl for Apple GPUs

These array types are subtypes of GPUArrays from GPUArrays.jl and closely resemble Base.Array which enables us to write generic code which works on both CPU and GPU arrays.

In [ ]:

if CUDA.functional()
    A_d = CuArray([1,2,3,4])
    A_d .+= 1
end

We can do the same operation with other subtypes of GPUArrays:

if AMDGPU.functional()
    A_d = ROCArray([1,2,3,4])
    A_d .+= 1
end

if oneAPI.functional()
    A_d = oneArray([1,2,3,4])
    A_d .+= 1
end

A_d = MtlArray([1,2,3,4])
A_d .+= 1

Moving an array back from the GPU to the CPU is simple:

In [ ]:

if CUDA.functional()
    A = Array(A_d)
end

However, the overhead of copying data to the GPU makes such simple calculations very slow.

Let’s have a look at a more realistic example: matrix multiplication. We create two random arrays, one on the CPU and one on the GPU, and compare the performance:

In [ ]:

if CUDA.functional()
    A = rand(2^12, 2^12)
    A_d = CuArray(A)

    @btime $A * $A
    CUDA.@time A_d * A_d
end

In [ ]:

if CUDA.functional()
    A = rand(Float32, 2^12, 2^12)
    A_d = CuArray(A)
    @btime $A * $A
    CUDA.@time A_d * A_d
end

GPUs normally perform significantly better for 32-bit floats. Some GPUs doesn't support 64-bit floats!

Many array operations in Julia are implemented using loops, processing one element at a time. Doing so with GPU arrays is very ineffective, as the loop won't actually execute on the GPU, but transfer one element at a time and process it on the CPU. As this wrecks performance, you will be warned when performing this kind of iteration:

In [ ]:

if CUDA.functional()
    A_d[1] = 3.0
end

Scalar indexing is only allowed in an interactive session, e.g. the REPL, because it is convenient when porting CPU code to the GPU. If you want to disallow scalar indexing, e.g. to verify that your application executes correctly on the GPU, call the allowscalar function:

In [ ]:

if CUDA.functional()
    CUDA.allowscalar(false)
    A_d[1] = 3.0
end

In a non-interactive session, e.g. when running code from a script or application, scalar indexing is disallowed by default. There is no global toggle to allow scalar indexing; if you really need it, you can mark expressions using allowscalar with do-block syntax or @allowscalar macro:

In [ ]:

if CUDA.functional()
    CUDA.allowscalar(false)

    CUDA.allowscalar() do
        A_d[1] += 1
    end

    CUDA.@allowscalar A_d[1] += 1
end

Nvidia provides CUDA toolkit, a collection of libraries that contain precompiled kernels for common operations like matrix multiplication (cuBLAS), fast Fourier transforms (cuFFT), linear solvers (cuSOLVER), sparse linear algebra (CUSPARSE), etc. These kernels are wrapped in CUDA.jl and can be used directly with CuArrays.

The recommended way to use CUDA.jl is to let it automatically download an appropriate CUDA toolkit. CUDA.jl will check your driver's capabilities, which versions of CUDA are available for your platform, and automatically download an appropriate artifact containing all the libraries that CUDA.jl supports.

CUDA.set_runtime_version!( v"11.8" )

To use a local installation, you can invoke the same API but set the version to "local":

CUDA.set_runtime_version!( local_toolkit=true )

In [ ]:

if CUDA.functional()
    CUDA.versioninfo()
end

Let's do a guided tour of what is inside CUDA.jl!

In [ ]:

if CUDA.functional()
    using CUDA.CUBLAS
    using CUDA.CUFFT
    using CUDA.CUSOLVER
    using CUDA.CUSPARSE
end

A powerful way to program GPUs with arrays is through Julia’s higher-order array abstractions. The simple element-wise addition we saw above, a .+= 1, is an example of this, but more general constructs can be created with broadcast, map, reduce, accumulate etc:

In [ ]:

if CUDA.functional()
    broadcast(-, A_d, 1)
end

In [ ]:

if CUDA.functional()
    map(x -> x+1, A_d)
end

In [ ]:

if CUDA.functional()
    reduce(+, A_d)
end

In [ ]:

if CUDA.functional()
    accumulate(+, A_d)
end

Using the high-level GPU array functionality made it easy to perform this computation on the GPU. However, we didn't learn about what's going on under the hood, and that's the main goal of this tutorial. It's time to write our own kernels!

In [ ]:

function vadd!(C, A, B)
    for i in 1:length(A)
        @inbounds C[i] = A[i] + B[i]
    end
    return nothing
end

In [ ]:

A = ones(10)
B = ones(10)
C = similar(B)
vadd!(C, A, B)
C

In [ ]:

if CUDA.functional()
    # We can already run this on the GPU with the @cuda macro,
    # which will compile vadd!() into a GPU kernel and launch it
    A_d = CuArray(A)
    B_d = CuArray(B)
    C_d = similar(B_d)
    @cuda vadd!(C_d, A_d, B_d)
    C_d
end

The macros for the other GPU backends are @roc, @oneapi and @metal.

The performance are just terrible because each thread on the GPU would be performing the same loop! So we have to remove the loop over all elements and instead use the special threadIdx and blockDim functions, analogous respectively to threadid and nthreads for multithreading.

We can split work between the GPU threads by using a special function which returns the index of the GPU thread which executes it.

GPU kernel: a function that will be executed by all GPU threads in parallel.

Based on the index of a thread we can make them operate on different pieces of give n data.

(It might be helpful to think of the GPU kernel as being the body of a loop.)

In [ ]:

function vadd2!(C, A, B)
    index = threadIdx().x   # linear indexing, so only use `x`
    @inbounds C[index] = A[index] + B[index]
    return nothing
end

In [ ]:

if CUDA.functional()
    N = 2^8
    A = 2 * CUDA.ones(N)
    B = 3 * CUDA.ones(N)
    C = similar(B)

    nthreads = N
    @cuda threads=nthreads vadd2!(C, A, B)
end

In [ ]:

if CUDA.functional()
    all(Array(C) .== 5.0)
end

The syntax is similar for the other GPU backends!

groupsize = length(A)
@roc groupsize=groupsize vadd!(C, A, B)

items = length(A)
@oneapi items=items vadd!(C, A, B)

nthreads = length(A)
@metal threads=nthreads vadd!(C, A, B)

To do even better, we need to parallelize more. GPUs have a limited number of threads they can run on a single streaming multiprocessor (SM), but they also have multiple SMs. To take advantage of them all, we need to run a kernel with multiple blocks. We'll divide up the work like this:

gpu_threads_block

Conceptual mapping:

Grid of blocks → entire GPU
Blocks of threads → SMs
Threads → CUDA cores

Note: up to three dimensions, $(x, y, z)$, can be used to organize the thread blocks and threads in each block.

This diagram was borrowed from a description of the NVIDIA C/C++ library; in Julia, threads and blocks begin numbering with 1 instead of 0. In this diagram, the 4096 blocks of 256 threads (making 1048576 = 2^20 threads) ensures that each thread increments just a single entry; however, to ensure that arrays of arbitrary size can be handled, let's still use a loop:

In [ ]:

function vadd3!(C, A, B)
    index = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    stride = gridDim().x * blockDim().x
    for i = index:stride:length(B)
        @inbounds C[index] = A[index] + B[index]
    end
end

In [ ]:

if CUDA.functional()
    nthreads = CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
end

The maximum number of allowed threads to launch depends on your GPU!

In [ ]:

if CUDA.functional()
    N = 2^14
    A = 2 * CUDA.ones(N)
    B = 3 * CUDA.ones(N)
    C = similar(B)

    # smallest integer larger than or equal to N / nthreads
    numblocks = ceil(Int, N/nthreads)
end

In [ ]:

if CUDA.functional()
    @cuda threads=nthreads blocks=numblocks vadd3!(C, A, B)
end

In [ ]:

all(Array(C) .== 5.0)

CUDA.jl supports indexing in up to 3 dimensions (x, y and z, e.g. threadIdx().z). This is convenient for multidimensional data where thread blocks can be organised into 1D, 2D or 3D arrays of threads.

To automatically select an appropriate number of threads, it is recommended to use the launch configuration API. This API takes a compiled (but not launched) kernel, returns a tuple with an upper bound on the number of threads, and the minimum number of blocks that are required to fully saturate the GPU:

To optimize the number of threads, we can first create the kernel without launching it, query it for the number of threads supported, and then launch the compiled kernel:

In [ ]:

# compile kernel
kernel = @cuda launch=false vadd3!(C, A, B)

# extract configuration via occupancy API
config = launch_configuration(kernel.fun)

# number of threads should not exceed size of array
threads = min(length(A), config.threads)

# smallest integer larger than or equal to length(A)/threads
blocks = cld(length(A), threads)

# launch kernel with specific configuration
kernel(C, A, B; threads, blocks)

Debugging: Many things can go wrong with GPU kernel programming and unfortunately error messages are sometimes not very useful because of how the GPU compiler works.

Conventional print-debugging is often a reasonably effective way to debug GPU code. CUDA.jl provides macros that facilitate this:

@cushow (like @show): visualize an expression and its result, and return that value.
@cuprintln (like println): to print text and values.
@cuaassert (like @assert) can also be useful to find issues and abort execution.

GPU code introspection macros also exist, like @device_code_warntype, to track down type instabilities.

In [ ]:

function gpu_add_print!(y, x)
    index = threadIdx().x    # this example only requires linear indexing, so just use `x`
    stride = blockDim().x
    @cuprintln("thread $index, block $stride")
    for i = index:stride:length(y)
        @inbounds y[i] += x[i]
    end
    return nothing
end

if CUDA.functional()
    x_d = CUDA.rand(10)
    y_d = CUDA.rand(10)
    @cuda threads=10 gpu_add_print!(y_d, x_d)
    synchronize()
end

Conclusion: Keep in mind that the high-level functionality of CUDA often means that you don't need to worry about writing kernels at such a low level. However, there are many cases where computations can be optimized using clever low-level manipulations. The kernels implemented in Julia give you all the flexibility and performance a GPU has to offer, within a familiar language.

A typical approach for porting or developing an application for the GPU is as follows:

develop an application using generic array functionality, and test it on the CPU with the Array type;
port your application to the GPU by switching to the CuArray type;
disallow the CPU fallback ("scalar indexing") to find operations that are not implemented for or incompatible with GPU execution;
(optional) use lower-level, CUDA-specific interfaces to implement missing functionality or optimize performance.

Exercise: GPU-port the sqrt_sum function we saw in te first notebook:

In [ ]:

function sqrt_sum(A)
    T = eltype(A)
    s = zero(T)
    for i in eachindex(A)
        @inbounds s += sqrt(A[i])
    end
    return s
end

Parallel computing and GPU programming with Julia¶

Part III: GPU programming¶

References:¶