Mojo🔥 GPU Puzzles

Overview

Implement a kernel that broadcast adds vector a and vector b and stores it in 2D matrix out.

Note: You have more threads than positions.

Key concepts

In this puzzle, you’ll learn about:

Broadcasting 1D vectors across different dimensions
Using 2D thread indices for broadcast operations
Handling boundary conditions in broadcast patterns

The key insight is understanding how to map elements from two 1D vectors to create a 2D output matrix through broadcasting, while handling thread bounds correctly.

Broadcasting: Each element of a combines with each element of b
Thread mapping: 2D thread grid \((3 \times 3)\) for \(2 \times 2\) output
Vector access: Different access patterns for a and b
Bounds checking: Guard against threads outside matrix dimensions

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn broadcast_add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)

View full file: problems/p05/p05.mojo

Tips

Get 2D indices: row = thread_idx.y, col = thread_idx.x
Add guard: if row < size and col < size
Inside guard: think about how to broadcast values of a and b

Running the code

To test your solution, run the following command in your terminal:

uv run poe p05

pixi run p05

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])

Solution

fn broadcast_add(
    output: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:
        output[row * size + col] = a[col] + b[row]

This solution demonstrates fundamental GPU broadcasting concepts without LayoutTensor abstraction:

Thread to matrix mapping
- Uses thread_idx.y for row access and thread_idx.x for column access
- Direct mapping from 2D thread grid to output matrix elements
- Handles excess threads (3×3 grid) for 2×2 output matrix
Broadcasting mechanics
- Vector a broadcasts horizontally: same a[col] used across each row
- Vector b broadcasts vertically: same b[row] used across each column
- Output combines both vectors through addition
```
[ a0 a1 ]  +  [ b0 ]  =  [ a0+b0  a1+b0 ]
              [ b1 ]     [ a0+b1  a1+b1 ]
```
Bounds checking
- Single guard condition row < size and col < size handles both dimensions
- Prevents out-of-bounds access for both input vectors and output matrix
- Required due to 3×3 thread grid being larger than 2×2 data

Compare this with the LayoutTensor version to see how the abstraction simplifies broadcasting operations while maintaining the same underlying concepts.