GPU Functions in Mojo: A CUDA-style Programming Interface
Help us improve and tell us what you’d like us to build next.
Request a recipe topicREADME
In this recipe, we will cover:
We'll walk through four GPU programming examples:
Let's get started.
Please make sure your system meets our system requirements.
To proceed, ensure you have the magic
CLI installed with the magic --version
to be 0.7.2 or newer:
curl -ssL https://magic.modular.com/ | bash
or update it via:
magic self-update
These examples require a MAX-compatible GPU satisfying these requirements:
Download this recipe using the magic
CLI:
magic init gpu-functions-mojo --from gpu-functions-mojo
cd gpu-functions-mojo
Run the examples:
magic run vector_addition
magic run grayscale
magic run naive_matrix_multiplication
magic run mandelbrot
MAX is a flexible, hardware-independent framework for programming GPUs and CPUs. It lets you get the most out of accelators without requiring architecture-specific frameworks like CUDA. MAX has many levels, from highly optimized AI models, to the computational graphs that define those models, to direct GPU programming in the Mojo language. You can use whatever level of abstraction best suits your needs in MAX.
Mojo is a Python-family language built for high-performance computing. It allows you to write custom algorithms for GPUs without the use of CUDA or other vendor-specific libraries. All of the operations that power AI models within MAX are written in Mojo.
We'll demonstrate an entry point into GPU programming with MAX allowing you to define, compile, and dispatch onto a GPU individual thread-based functions. This is powered by the MAX Driver API, which handles all the hardware-specific details of allocating and transferring memory between host and accelerator, as well as compilation and execution of accelerator-targeted functions.
The first three examples in this recipe show common starting points for thread-based GPU programming. They follow the first three examples in the popular GPU programming textbook Programming Massively Parallel Processors:
The final example demonstrates calculating the Mandelbrot set on the GPU.
These examples also work hand-in-hand with our guide to the basics of GPU programming in Mojo, which we recommend reading alongside this recipe.
The common "hello world" example used for data-parallel programming is the addition of each element in two vectors. Let's take a look at how you can implement that in MAX.
Define the vector addition function.
The function itself is very simple, running once per thread, adding each element in the two input vectors that correspond to that thread ID, and storing the result in the output vector at the matching location. If a block of threads is dispatched that is wider than the vector length, no work is performed by threads past the end of the vector.
fn vector_addition(
lhs: LayoutTensor[float_dtype],
rhs: LayoutTensor[float_dtype],
out: LayoutTensor[float_dtype],
):
tid = block_dim.x * block_idx.x + thread_idx.x
if tid < out.layout.size():
out[tid] = lhs[tid] + rhs[tid]
Obtain a reference to the host (CPU) and accelerator (GPU) devices.
gpu_device = accelerator()
host_device = cpu()
If no MAX-compatible accelerator is found, this exits with an error.
Allocate input and output vectors.
The left-hand-side and right-hand-side vectors need to be allocated on the
CPU, initialized with values, and then moved to the GPU for use by our
function. This is done using the MAX Driver API's
Tensor
type.
alias float_dtype = DType.float32
alias tensor_rank = 1
alias VECTOR_WIDTH = 10
lhs_tensor = Tensor[float_dtype, 1]((VECTOR_WIDTH), host_device)
rhs_tensor = Tensor[float_dtype, 1]((VECTOR_WIDTH), host_device)
for i in range(VECTOR_WIDTH):
lhs_tensor[i] = 1.25
rhs_tensor[i] = 2.5
lhs_tensor = lhs_tensor.move_to(gpu_device)
rhs_tensor = rhs_tensor.move_to(gpu_device)
A tensor to hold the result of the calculation is allocated on the GPU:
out_tensor = Tensor[float_dtype, tensor_rank](
(VECTOR_WIDTH), gpu_device
)
Compile and dispatch the function.
The actual vector_addition()
function we want to run on the GPU is
compiled and dispatched across a grid, divided into blocks of threads. All
arguments to this GPU function are provided here, in an order that
corresponds to their location in the function signature. Note that in MAX,
the GPU function is compiled for the GPU at the time of compilation of the
Mojo file containing it.
gpu_function = Accelerator.compile[vector_addition](gpu_device)
alias BLOCK_SIZE = 16
var num_blocks = ceildiv(VECTOR_WIDTH, BLOCK_SIZE)
gpu_function(
gpu_device,
lhs_tensor.to_layout_tensor(),
rhs_tensor.to_layout_tensor(),
out_tensor.to_layout_tensor(),
grid_dim=Dim(num_blocks),
block_dim=Dim(BLOCK_SIZE),
)
Return the results.
Finally, the results of the calculation are moved from the GPU back to the host to be examined:
out_tensor = out_tensor.move_to(host_device)
print("Resulting vector:", out_tensor)
To try out this example yourself, run it using the following command:
magic run vector_addition
For this initial example, the output you see should be a vector where all the
elements are 3.75
. Experiment with changing the vector length, the block size,
and other parameters to see how the calculation scales.
As a slightly more complex example, the next step in this recipe shows how to convert a red-green-blue (RGB) color image into grayscale. This uses a rank-3 tensor to host the 2-D image and the color channels at each pixel. The inputs start with three color channels, and the output has only a single grayscale channel.
The calculation performed is a common reduction to luminance using weighted values for the three channels:
gray = 0.21 * red + 0.71 * green + 0.07 * blue
And here is the per-thread function to perform this on the GPU:
fn color_to_grayscale_conversion(
width: Int,
height: Int,
image: LayoutTensor[channel_dtype],
out: LayoutTensor[channel_dtype],
):
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x
if col < width and row < height:
red = image[row, col, 0].cast[internal_float_dtype]()
green = image[row, col, 1].cast[internal_float_dtype]()
blue = image[row, col, 2].cast[internal_float_dtype]()
gray = 0.21 * red + 0.71 * green + 0.07 * blue
out[row, col, 0] = gray.cast[channel_dtype]()
The setup, compilation, and execution of this function is much the same as in
the previous example, but in this case we're using rank-3 instead of rank-1
Tensor
s to hold the values. Also, we dispatch the function over a 2-D grid
of block, which looks like the following:
alias BLOCK_SIZE = 16
num_col_blocks = ceildiv(IMAGE_WIDTH, BLOCK_SIZE)
num_row_blocks = ceildiv(IMAGE_HEIGHT, BLOCK_SIZE)
gpu_function(
gpu_device,
IMAGE_WIDTH,
IMAGE_HEIGHT,
rgb_tensor.to_layout_tensor(),
gray_tensor.to_layout_tensor(),
grid_dim=Dim(num_col_blocks, num_row_blocks),
block_dim=Dim(BLOCK_SIZE, BLOCK_SIZE),
)
To run this example, run this command:
magic run grayscale
This will show a grid of numbers representing the grayscale values for a single color broadcast across a simple input image. Try changing the image and block sizes to see how this scales on the GPU.
The next example performs a very basic matrix multiplication, with no optimizations to take advantage of hardware resources. The GPU function for this looks like the following:
fn naive_matrix_multiplication[
m_layout: Layout,
n_layout: Layout,
p_layout: Layout,
](
m: LayoutTensor[float_dtype, m_layout, MutableAnyOrigin],
n: LayoutTensor[float_dtype, n_layout, MutableAnyOrigin],
p: LayoutTensor[float_dtype, p_layout, MutableAnyOrigin],
):
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x
m_dim = p.dim(0)
n_dim = p.dim(1)
k_dim = n.dim(0)
if row < m_dim and col < n_dim:
for j_index in range(k_dim):
p[row, col] += m[row, j_index] * n[j_index, col]
The overall setup and execution of this function are extremely similar to the previous example, with the primary change being the function that is run on the GPU.
To try out this example, run this command:
magic run naive_matrix_multiplication
You will see the two input matrices printed to the console, as well as the result of their multiplication. As with the previous examples, try changing the sizes of the matrices and how they are dispatched on the GPU.
The final example in this recipe shows a slightly more complex calculation (pun intended): the Mandelbrot set fractal. This custom operation takes no input tensors, only a set of scalar arguments, and returns a 2-D matrix of integer values representing the number of iterations it took to escape at that location in complex number space.
The per-thread GPU function for this is as follows:
fn mandelbrot(
min_x: Scalar[float_dtype],
min_y: Scalar[float_dtype],
scale_x: Scalar[float_dtype],
scale_y: Scalar[float_dtype],
max_iterations: Scalar[int_dtype],
out: TensorType,
):
var row = block_dim.y * block_idx.y + thread_idx.y
var col = block_dim.x * block_idx.x + thread_idx.x
var cx = min_x + col * scale_x
var cy = min_y + row * scale_y
var c = ComplexSIMD[float_dtype, 1](cx, cy)
var z = ComplexSIMD[float_dtype, 1](0, 0)
var iters = Scalar[int_dtype](0)
var in_set_mask: Scalar[DType.bool] = True
for _ in range(max_iterations):
if not any(in_set_mask):
break
in_set_mask = z.squared_norm() <= 4
iters = in_set_mask.select(iters + 1, iters)
z = z.squared_add(c)
out[row, col] = iters
This begins by calculating the complex number which represents a given location
in the output grid (C). Then, starting from Z=0
, the calculation Z=Z^2 + C
is iteratively calculated until Z exceeds 4, the threshold we're using for when
Z will escape the set. This occurs up until a maximum number of iterations,
and the number of iterations to escape (or not, if the maximum is hit) is then
returned for each location in the grid.
The area to examine in complex space, the resolution of the grid, and the maximum number of iterations are all provided as constants:
alias MIN_X: Scalar[float_dtype] = -2.0
alias MAX_X: Scalar[float_dtype] = 0.7
alias MIN_Y: Scalar[float_dtype] = -1.12
alias MAX_Y: Scalar[float_dtype] = 1.12
alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH
alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT
alias MAX_ITERATIONS = 100
You can calculate the Mandelbrot set on the GPU using this command:
magic run mandelbrot
The result should be an ASCII art depiction of the region covered by the calculation:
...................................,,,,c@8cc,,,.............
...............................,,,,,,cc8M @Mjc,,,,..........
............................,,,,,,,ccccM@aQaM8c,,,,,........
..........................,,,,,,,ccc88g.o. Owg8ccc,,,,......
.......................,,,,,,,,c8888M@j, ,wMM8cccc,,.....
.....................,,,,,,cccMQOPjjPrgg, OrwrwMMMjjc,....
..................,,,,cccccc88MaP @ ,pGa.g8c,...
...............,,cccccccc888MjQp. o@8cc,..
..........,,,,c8jjMMMMMMMMM@@w. aj8c,,.
.....,,,,,,ccc88@QEJwr.wPjjjwG w8c,,.
..,,,,,,,cccccMMjwQ EpQ .8c,,,
.,,,,,,cc888MrajwJ MMcc,,,
.cc88jMMM@@jaG. oM8cc,,,
.cc88jMMM@@jaG. oM8cc,,,
.,,,,,,cc888MrajwJ MMcc,,,
..,,,,,,,cccccMMjwQ EpQ .8c,,,
.....,,,,,,ccc88@QEJwr.wPjjjwG w8c,,.
..........,,,,c8jjMMMMMMMMM@@w. aj8c,,.
...............,,cccccccc888MjQp. o@8cc,..
..................,,,,cccccc88MaP @ ,pGa.g8c,...
.....................,,,,,,cccMQOEjjPrgg, OrwrwMMMjjc,....
.......................,,,,,,,,c8888M@j, ,wMM8cccc,,.....
..........................,,,,,,,ccc88g.o. Owg8ccc,,,,......
............................,,,,,,,ccccM@aQaM8c,,,,,........
...............................,,,,,,cc8M @Mjc,,,,..........
Try changing the various parameters above to produce different resolution grids, or look into different areas in the complex number space.
In this recipe, we've demonstrated how to perform the very basics of thread-parallel GPU programming in MAX using Mojo and the MAX Driver API. This is a programming model that is very familiar to those used to CUDA C, and we have used a series of common examples to show how to map concepts from CUDA to MAX.
MAX has far more power available for fully utilizing GPUs through building computational graphs and then defining custom operations in Mojo using hardware-optimized abstractions. MAX is an extremely flexible framework for programming GPUs, with interfaces at many levels. In this recipe, you've been introduced to the very basics of getting started with GPU programming in MAX and Mojo, but there's much more to explore!
Try applying GPU programming in MAX for more complex workloads via tutorials on the MAX Graph API and defining custom GPU graph operations in Mojo.
Explore MAX's documentation for additional
features. The gpu
module has
detail on Mojo's GPU programming functions and types.
Join our Modular Forum and Discord community to share your experiences and get support.
We're excited to see what you'll build with MAX! Share your projects and experiences with us using #ModularAI
on social media.
DETAILS
THE CODE
gpu-functions-mojo
AUTHOR
Brad Larson
AVAILABLE TASKS
magic run vector_addition
magic run grayscale
magic run naive_matrix_multiplication
magic run mandelbrot
PROBLEMS WITH THE CODE?
File an Issue
TAGS
Help us improve and tell us what you’d like us to build next.
Request a recipe topic