Overview
Implement a kernel that adds 10 to each position of matrix a
and stores it in out
.
Note: You have fewer threads per block than the size of a
in both directions.
Key concepts
In this puzzle, you’ll learn about:
- Working with 2D block and thread arrangements
- Handling matrix data larger than block size
- Converting between 2D and linear memory access
The key insight is understanding how to coordinate multiple blocks of threads to process a 2D matrix that’s larger than a single block’s dimensions.
Configuration
- Matrix size: \(5 \times 5\) elements
- 2D blocks: Each block processes a \(3 \times 3\) region
- Grid layout: Blocks arranged in \(2 \times 2\) grid
- Total threads: \(36\) for \(25\) elements
- Memory pattern: Row-major storage for 2D data
- Coverage: Ensuring all matrix elements are processed
Code to complete
alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
fn add_10_blocks_2d(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x
# FILL ME IN (roughly 2 lines)
View full file: problems/p07/p07.mojo
Tips
- Calculate global indices:
row = block_dim.y * block_idx.y + thread_idx.y
,col = block_dim.x * block_idx.x + thread_idx.x
- Add guard:
if row < size and col < size
- Inside guard: think about how to add 10 in row-major way!
Running the code
To test your solution, run the following command in your terminal:
uv run poe p07
pixi run p07
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])
Solution
fn add_10_blocks_2d(
output: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x
if row < size and col < size:
output[row * size + col] = a[row * size + col] + 10.0
This solution demonstrates key concepts of 2D block-based processing with raw memory:
-
2D thread indexing
- Global row:
block_dim.y * block_idx.y + thread_idx.y
- Global col:
block_dim.x * block_idx.x + thread_idx.x
- Maps thread grid to matrix elements:
(* = thread exists but outside matrix bounds)5×5 matrix with 3×3 blocks: Block (0,0) Block (1,0) [(0,0) (0,1) (0,2)] [(0,3) (0,4) * ] [(1,0) (1,1) (1,2)] [(1,3) (1,4) * ] [(2,0) (2,1) (2,2)] [(2,3) (2,4) * ] Block (0,1) Block (1,1) [(3,0) (3,1) (3,2)] [(3,3) (3,4) * ] [(4,0) (4,1) (4,2)] [(4,3) (4,4) * ] [ * * * ] [ * * * ]
- Global row:
-
Memory layout
- Row-major linear memory:
index = row * size + col
- Example for 5×5 matrix:
2D indices: Linear memory: (2,1) -> 11 [00 01 02 03 04] [05 06 07 08 09] [10 11 12 13 14] [15 16 17 18 19] [20 21 22 23 24]
- Row-major linear memory:
-
Bounds checking
- Guard
row < size and col < size
handles:- Excess threads in partial blocks
- Edge cases at matrix boundaries
- 2×2 block grid with 3×3 threads each = 36 threads for 25 elements
- Guard
-
Block coordination
- Each 3×3 block processes part of 5×5 matrix
- 2×2 grid of blocks ensures full coverage
- Overlapping threads handled by bounds check
- Efficient parallel processing across blocks
This pattern shows how to handle 2D data larger than block size while maintaining efficient memory access and thread coordination.