Overview

Implement a kernel that adds 10 to each position of matrix a and stores it in out.

Note: You have fewer threads per block than the size of a in both directions.

Key concepts

In this puzzle, you’ll learn about:

  • Working with 2D block and thread arrangements
  • Handling matrix data larger than block size
  • Converting between 2D and linear memory access

The key insight is understanding how to coordinate multiple blocks of threads to process a 2D matrix that’s larger than a single block’s dimensions.

Configuration

  • Matrix size: \(5 \times 5\) elements
  • 2D blocks: Each block processes a \(3 \times 3\) region
  • Grid layout: Blocks arranged in \(2 \times 2\) grid
  • Total threads: \(36\) for \(25\) elements
  • Memory pattern: Row-major storage for 2D data
  • Coverage: Ensuring all matrix elements are processed

Code to complete

alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn add_10_blocks_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p07/p07.mojo

Tips
  1. Calculate global indices: row = block_dim.y * block_idx.y + thread_idx.y, col = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard: think about how to add 10 in row-major way!

Running the code

To test your solution, run the following command in your terminal:

uv run poe p07
pixi run p07

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])

Solution

fn add_10_blocks_2d(
    output: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:
        output[row * size + col] = a[row * size + col] + 10.0


This solution demonstrates key concepts of 2D block-based processing with raw memory:

  1. 2D thread indexing

    • Global row: block_dim.y * block_idx.y + thread_idx.y
    • Global col: block_dim.x * block_idx.x + thread_idx.x
    • Maps thread grid to matrix elements:
      5×5 matrix with 3×3 blocks:
      
      Block (0,0)         Block (1,0)
      [(0,0) (0,1) (0,2)] [(0,3) (0,4)    *  ]
      [(1,0) (1,1) (1,2)] [(1,3) (1,4)    *  ]
      [(2,0) (2,1) (2,2)] [(2,3) (2,4)    *  ]
      
      Block (0,1)         Block (1,1)
      [(3,0) (3,1) (3,2)] [(3,3) (3,4)    *  ]
      [(4,0) (4,1) (4,2)] [(4,3) (4,4)    *  ]
      [  *     *     *  ] [  *     *      *  ]
      
      (* = thread exists but outside matrix bounds)
  2. Memory layout

    • Row-major linear memory: index = row * size + col
    • Example for 5×5 matrix:
      2D indices:    Linear memory:
      (2,1) -> 11   [00 01 02 03 04]
                    [05 06 07 08 09]
                    [10 11 12 13 14]
                    [15 16 17 18 19]
                    [20 21 22 23 24]
      
  3. Bounds checking

    • Guard row < size and col < size handles:
      • Excess threads in partial blocks
      • Edge cases at matrix boundaries
      • 2×2 block grid with 3×3 threads each = 36 threads for 25 elements
  4. Block coordination

    • Each 3×3 block processes part of 5×5 matrix
    • 2×2 grid of blocks ensures full coverage
    • Overlapping threads handled by bounds check
    • Efficient parallel processing across blocks

This pattern shows how to handle 2D data larger than block size while maintaining efficient memory access and thread coordination.