Mojo🔥 GPU Puzzles

Overview

Implement a kernel that adds 10 to each position of matrix a and stores it in out.

Note: You have fewer threads per block than the size of a in both directions.

Key concepts

In this puzzle, you’ll learn about:

Working with 2D block and thread arrangements
Handling matrix data larger than block size
Converting between 2D and linear memory access

The key insight is understanding how to coordinate multiple blocks of threads to process a 2D matrix that’s larger than a single block’s dimensions.

Configuration

Matrix size: \(5 \times 5\) elements
2D blocks: Each block processes a \(3 \times 3\) region
Grid layout: Blocks arranged in \(2 \times 2\) grid
Total threads: \(36\) for \(25\) elements
Memory pattern: Row-major storage for 2D data
Coverage: Ensuring all matrix elements are processed

Code to complete

alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn add_10_blocks_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)

View full file: problems/p07/p07.mojo

Tips

Calculate global indices: row = block_dim.y * block_idx.y + thread_idx.y, col = block_dim.x * block_idx.x + thread_idx.x
Add guard: if row < size and col < size
Inside guard: think about how to add 10 in row-major way!

Running the code

To test your solution, run the following command in your terminal:

uv run poe p07

pixi run p07

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])

Solution

fn add_10_blocks_2d(
    output: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:
        output[row * size + col] = a[row * size + col] + 10.0

This solution demonstrates key concepts of 2D block-based processing with raw memory:

2D thread indexing

Global row: block_dim.y * block_idx.y + thread_idx.y
Global col: block_dim.x * block_idx.x + thread_idx.x

Maps thread grid to matrix elements:

5×5 matrix with 3×3 blocks:

Block (0,0)         Block (1,0)
[(0,0) (0,1) (0,2)] [(0,3) (0,4)    *  ]
[(1,0) (1,1) (1,2)] [(1,3) (1,4)    *  ]
[(2,0) (2,1) (2,2)] [(2,3) (2,4)    *  ]

Block (0,1)         Block (1,1)
[(3,0) (3,1) (3,2)] [(3,3) (3,4)    *  ]
[(4,0) (4,1) (4,2)] [(4,3) (4,4)    *  ]
[  *     *     *  ] [  *     *      *  ]

(* = thread exists but outside matrix bounds)

Memory layout

Row-major linear memory: index = row * size + col

Example for 5×5 matrix:

2D indices:    Linear memory:
(2,1) -> 11   [00 01 02 03 04]
              [05 06 07 08 09]
              [10 11 12 13 14]
              [15 16 17 18 19]
              [20 21 22 23 24]

Bounds checking
- Guard row < size and col < size handles:
  - Excess threads in partial blocks
  - Edge cases at matrix boundaries
  - 2×2 block grid with 3×3 threads each = 36 threads for 25 elements
Block coordination
- Each 3×3 block processes part of 5×5 matrix
- 2×2 grid of blocks ensures full coverage
- Overlapping threads handled by bounds check
- Efficient parallel processing across blocks

This pattern shows how to handle 2D data larger than block size while maintaining efficient memory access and thread coordination.