LayoutTensor Version

Overview

Implement a kernel that adds 10 to each position of 2D LayoutTensor a and stores it in 2D LayoutTensor out.

Note: You have fewer threads per block than the size of a in both directions.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor with multiple blocks
  • Handling large matrices with 2D block organization
  • Combining block indexing with LayoutTensor access

The key insight is that LayoutTensor simplifies 2D indexing while still requiring proper block coordination for large matrices.

Configuration

  • Matrix size: \(5 \times 5\) elements
  • Layout handling: LayoutTensor manages row-major organization
  • Block coordination: Multiple blocks cover the full matrix
  • 2D indexing: Natural \((i,j)\) access with bounds checking
  • Total threads: \(36\) for \(25\) elements
  • Thread mapping: Each thread processes one matrix element

Code to complete

alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(SIZE, SIZE)


fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, a_layout],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p07/p07_layout_tensor.mojo

Tips
  1. Calculate global indices: row = block_dim.y * block_idx.y + thread_idx.y, col = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard: think about how to add 10 to 2D LayoutTensor

Running the code

To test your solution, run the following command in your terminal:

uv run poe p07_layout_tensor
pixi run p07_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])

Solution

fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    output: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, a_layout],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:
        output[row, col] = a[row, col] + 10.0


This solution demonstrates how LayoutTensor simplifies 2D block-based processing:

  1. 2D thread indexing

    • Global row: block_dim.y * block_idx.y + thread_idx.y
    • Global col: block_dim.x * block_idx.x + thread_idx.x
    • Maps thread grid to tensor elements:
      5Ɨ5 tensor with 3Ɨ3 blocks:
      
      Block (0,0)         Block (1,0)
      [(0,0) (0,1) (0,2)] [(0,3) (0,4)    *  ]
      [(1,0) (1,1) (1,2)] [(1,3) (1,4)    *  ]
      [(2,0) (2,1) (2,2)] [(2,3) (2,4)    *  ]
      
      Block (0,1)         Block (1,1)
      [(3,0) (3,1) (3,2)] [(3,3) (3,4)    *  ]
      [(4,0) (4,1) (4,2)] [(4,3) (4,4)    *  ]
      [  *     *     *  ] [  *     *      *  ]
      
      (* = thread exists but outside tensor bounds)
  2. LayoutTensor benefits

    • Natural 2D indexing: tensor[row, col] instead of manual offset calculation
    • Automatic memory layout optimization
    • Example access pattern:
      Raw memory:         LayoutTensor:
      row * size + col    tensor[row, col]
      (2,1) -> 11        (2,1) -> same element
      
  3. Bounds checking

    • Guard row < size and col < size handles:
      • Excess threads in partial blocks
      • Edge cases at tensor boundaries
      • Automatic memory layout handling by LayoutTensor
      • 36 threads (2Ɨ2 blocks of 3Ɨ3) for 25 elements
  4. Block coordination

    • Each 3Ɨ3 block processes part of 5Ɨ5 tensor
    • LayoutTensor handles:
      • Memory layout optimization
      • Efficient access patterns
      • Block boundary coordination
      • Cache-friendly data access

This pattern shows how LayoutTensor simplifies 2D block processing while maintaining optimal memory access patterns and thread coordination.