LayoutTensor Version

Overview

Implement a kernel that adds 10 to each position of 2D LayoutTensor a and stores it in 2D LayoutTensor output.

Note: You have fewer threads per block than the size of a in both directions.

Key concepts

In this puzzle, you’ll learn about:

Using LayoutTensor with multiple blocks
Handling large matrices with 2D block organization
Combining block indexing with LayoutTensor access

The key insight is that LayoutTensor simplifies 2D indexing while still requiring proper block coordination for large matrices.

Configuration

Matrix size: \(5 \times 5\) elements
Layout handling: LayoutTensor manages row-major organization
Block coordination: Multiple blocks cover the full matrix
2D indexing: Natural \((i,j)\) access with bounds checking
Total threads: \(36\) for \(25\) elements
Thread mapping: Each thread processes one matrix element

Code to complete

alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(SIZE, SIZE)


fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    output: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, a_layout],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)

View full file: problems/p07/p07_layout_tensor.mojo

Tips

Calculate global indices: row = block_dim.y * block_idx.y + thread_idx.y, col = block_dim.x * block_idx.x + thread_idx.x
Add guard: if row < size and col < size
Inside guard: think about how to add 10 to 2D LayoutTensor

Running the code

To test your solution, run the following command in your terminal:

uv run poe p07_layout_tensor

pixi run p07_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])

Solution

fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    output: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, a_layout],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:
        output[row, col] = a[row, col] + 10.0

This solution demonstrates how LayoutTensor simplifies 2D block-based processing:

2D thread indexing

Global row: block_dim.y * block_idx.y + thread_idx.y
Global col: block_dim.x * block_idx.x + thread_idx.x

Maps thread grid to tensor elements:

5×5 tensor with 3×3 blocks:

Block (0,0)         Block (1,0)
[(0,0) (0,1) (0,2)] [(0,3) (0,4)    *  ]
[(1,0) (1,1) (1,2)] [(1,3) (1,4)    *  ]
[(2,0) (2,1) (2,2)] [(2,3) (2,4)    *  ]

Block (0,1)         Block (1,1)
[(3,0) (3,1) (3,2)] [(3,3) (3,4)    *  ]
[(4,0) (4,1) (4,2)] [(4,3) (4,4)    *  ]
[  *     *     *  ] [  *     *      *  ]

(* = thread exists but outside tensor bounds)

LayoutTensor benefits
- Natural 2D indexing: tensor[row, col] instead of manual offset calculation
- Automatic memory layout optimization
- Example access pattern:
```
Raw memory:         LayoutTensor:
row * size + col    tensor[row, col]
(2,1) -> 11        (2,1) -> same element
```
Bounds checking
- Guard row < size and col < size handles:
  - Excess threads in partial blocks
  - Edge cases at tensor boundaries
  - Automatic memory layout handling by LayoutTensor
  - 36 threads (2×2 blocks of 3×3) for 25 elements
Block coordination
- Each 3×3 block processes part of 5×5 tensor
- LayoutTensor handles:
  - Memory layout optimization
  - Efficient access patterns
  - Block boundary coordination
  - Cache-friendly data access

This pattern shows how LayoutTensor simplifies 2D block processing while maintaining optimal memory access patterns and thread coordination.