LayoutTensor Version
Overview
Implement a kernel that adds 10 to each position of 2D LayoutTensor a
and stores it in 2D LayoutTensor out
.
Note: You have fewer threads per block than the size of a
in both directions.
Key concepts
In this puzzle, youāll learn about:
- Using
LayoutTensor
with multiple blocks - Handling large matrices with 2D block organization
- Combining block indexing with
LayoutTensor
access
The key insight is that LayoutTensor
simplifies 2D indexing while still requiring proper block coordination for large matrices.
Configuration
- Matrix size: \(5 \times 5\) elements
- Layout handling:
LayoutTensor
manages row-major organization - Block coordination: Multiple blocks cover the full matrix
- 2D indexing: Natural \((i,j)\) access with bounds checking
- Total threads: \(36\) for \(25\) elements
- Thread mapping: Each thread processes one matrix element
Code to complete
alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(SIZE, SIZE)
fn add_10_blocks_2d[
out_layout: Layout,
a_layout: Layout,
](
out: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, a_layout],
size: Int,
):
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x
# FILL ME IN (roughly 2 lines)
View full file: problems/p07/p07_layout_tensor.mojo
Tips
- Calculate global indices:
row = block_dim.y * block_idx.y + thread_idx.y
,col = block_dim.x * block_idx.x + thread_idx.x
- Add guard:
if row < size and col < size
- Inside guard: think about how to add 10 to 2D LayoutTensor
Running the code
To test your solution, run the following command in your terminal:
uv run poe p07_layout_tensor
pixi run p07_layout_tensor
Your output will look like this if the puzzle isnāt solved yet:
out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])
Solution
fn add_10_blocks_2d[
out_layout: Layout,
a_layout: Layout,
](
output: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, a_layout],
size: Int,
):
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x
if row < size and col < size:
output[row, col] = a[row, col] + 10.0
This solution demonstrates how LayoutTensor simplifies 2D block-based processing:
-
2D thread indexing
- Global row:
block_dim.y * block_idx.y + thread_idx.y
- Global col:
block_dim.x * block_idx.x + thread_idx.x
- Maps thread grid to tensor elements:
(* = thread exists but outside tensor bounds)5Ć5 tensor with 3Ć3 blocks: Block (0,0) Block (1,0) [(0,0) (0,1) (0,2)] [(0,3) (0,4) * ] [(1,0) (1,1) (1,2)] [(1,3) (1,4) * ] [(2,0) (2,1) (2,2)] [(2,3) (2,4) * ] Block (0,1) Block (1,1) [(3,0) (3,1) (3,2)] [(3,3) (3,4) * ] [(4,0) (4,1) (4,2)] [(4,3) (4,4) * ] [ * * * ] [ * * * ]
- Global row:
-
LayoutTensor benefits
- Natural 2D indexing:
tensor[row, col]
instead of manual offset calculation - Automatic memory layout optimization
- Example access pattern:
Raw memory: LayoutTensor: row * size + col tensor[row, col] (2,1) -> 11 (2,1) -> same element
- Natural 2D indexing:
-
Bounds checking
- Guard
row < size and col < size
handles:- Excess threads in partial blocks
- Edge cases at tensor boundaries
- Automatic memory layout handling by LayoutTensor
- 36 threads (2Ć2 blocks of 3Ć3) for 25 elements
- Guard
-
Block coordination
- Each 3Ć3 block processes part of 5Ć5 tensor
- LayoutTensor handles:
- Memory layout optimization
- Efficient access patterns
- Block boundary coordination
- Cache-friendly data access
This pattern shows how LayoutTensor simplifies 2D block processing while maintaining optimal memory access patterns and thread coordination.