LayoutTensor Version
Overview
Implement a kernel that broadcast adds 1D LayoutTensor a
and 1D LayoutTensor b
and stores it in 2D LayoutTensor out
.
Note: You have more threads than positions.
Key concepts
In this puzzle, youāll learn about:
- Using
LayoutTensor
for broadcast operations - Working with different tensor shapes
- Handling 2D indexing with
LayoutTensor
The key insight is that LayoutTensor
allows natural broadcasting through different tensor shapes: \((1, n)\) and \((n, 1)\) to \((n,n)\), while still requiring bounds checking.
- Tensor shapes: Input vectors have shapes \((1, n)\) and \((n, 1)\)
- Broadcasting: Output combines both dimensions to \((n,n)\)
- Guard condition: Still need bounds checking for output size
- Thread bounds: More threads \((3 \times 3)\) than tensor elements \((2 \times 2)\)
Code to complete
alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(1, SIZE)
alias b_layout = Layout.row_major(SIZE, 1)
fn broadcast_add[
out_layout: Layout,
a_layout: Layout,
b_layout: Layout,
](
out: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, a_layout],
b: LayoutTensor[mut=False, dtype, b_layout],
size: Int,
):
row = thread_idx.y
col = thread_idx.x
# FILL ME IN (roughly 2 lines)
View full file: problems/p05/p05_layout_tensor.mojo
Tips
- Get 2D indices:
row = thread_idx.y
,col = thread_idx.x
- Add guard:
if row < size and col < size
- Inside guard: think about how to broadcast values of
a
andb
as LayoutTensors
Running the code
To test your solution, run the following command in your terminal:
uv run poe p05_layout_tensor
pixi run p05_layout_tensor
Your output will look like this if the puzzle isnāt solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])
Solution
fn broadcast_add[
out_layout: Layout,
a_layout: Layout,
b_layout: Layout,
](
output: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, a_layout],
b: LayoutTensor[mut=False, dtype, b_layout],
size: Int,
):
row = thread_idx.y
col = thread_idx.x
if row < size and col < size:
output[row, col] = a[0, col] + b[row, 0]
This solution demonstrates key concepts of LayoutTensor broadcasting and GPU thread mapping:
-
Thread to matrix mapping
- Uses
thread_idx.y
for row access andthread_idx.x
for column access - Natural 2D indexing matches the output matrix structure
- Excess threads (3Ć3 grid) are handled by bounds checking
- Uses
-
Broadcasting mechanics
- Input
a
has shape(1,n)
:a[0,col]
broadcasts across rows - Input
b
has shape(n,1)
:b[row,0]
broadcasts across columns - Output has shape
(n,n)
: Each element is sum of corresponding broadcasts
[ a0 a1 ] + [ b0 ] = [ a0+b0 a1+b0 ] [ b1 ] [ a0+b1 a1+b1 ]
- Input
-
Bounds Checking
- Guard condition
row < size and col < size
prevents out-of-bounds access - Handles both matrix bounds and excess threads efficiently
- No need for separate checks for
a
andb
due to broadcasting
- Guard condition
This pattern forms the foundation for more complex tensor operations weāll explore in later puzzles.