Overview
Implement a kernel that broadcast adds vector a
and vector b
and stores it in 2D matrix out
.
Note: You have more threads than positions.
Key concepts
In this puzzle, you’ll learn about:
- Broadcasting 1D vectors across different dimensions
- Using 2D thread indices for broadcast operations
- Handling boundary conditions in broadcast patterns
The key insight is understanding how to map elements from two 1D vectors to create a 2D output matrix through broadcasting, while handling thread bounds correctly.
- Broadcasting: Each element of
a
combines with each element ofb
- Thread mapping: 2D thread grid \((3 \times 3)\) for \(2 \times 2\) output
- Vector access: Different access patterns for
a
andb
- Bounds checking: Guard against threads outside matrix dimensions
Code to complete
alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
fn broadcast_add(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
b: UnsafePointer[Scalar[dtype]],
size: Int,
):
row = thread_idx.y
col = thread_idx.x
# FILL ME IN (roughly 2 lines)
View full file: problems/p05/p05.mojo
Tips
- Get 2D indices:
row = thread_idx.y
,col = thread_idx.x
- Add guard:
if row < size and col < size
- Inside guard: think about how to broadcast values of
a
andb
Running the code
To test your solution, run the following command in your terminal:
uv run poe p05
pixi run p05
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])
Solution
fn broadcast_add(
output: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
b: UnsafePointer[Scalar[dtype]],
size: Int,
):
row = thread_idx.y
col = thread_idx.x
if row < size and col < size:
output[row * size + col] = a[col] + b[row]
This solution demonstrates fundamental GPU broadcasting concepts without LayoutTensor abstraction:
-
Thread to matrix mapping
- Uses
thread_idx.y
for row access andthread_idx.x
for column access - Direct mapping from 2D thread grid to output matrix elements
- Handles excess threads (3×3 grid) for 2×2 output matrix
- Uses
-
Broadcasting mechanics
- Vector
a
broadcasts horizontally: samea[col]
used across each row - Vector
b
broadcasts vertically: sameb[row]
used across each column - Output combines both vectors through addition
[ a0 a1 ] + [ b0 ] = [ a0+b0 a1+b0 ] [ b1 ] [ a0+b1 a1+b1 ]
- Vector
-
Bounds checking
- Single guard condition
row < size and col < size
handles both dimensions - Prevents out-of-bounds access for both input vectors and output matrix
- Required due to 3×3 thread grid being larger than 2×2 data
- Single guard condition
Compare this with the LayoutTensor version to see how the abstraction simplifies broadcasting operations while maintaining the same underlying concepts.