LayoutTensor Version

Overview

Implement a kernel that broadcast adds 1D LayoutTensor a and 1D LayoutTensor b and stores it in 2D LayoutTensor out.

Note: You have more threads than positions.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor for broadcast operations
  • Working with different tensor shapes
  • Handling 2D indexing with LayoutTensor

The key insight is that LayoutTensor allows natural broadcasting through different tensor shapes: \((1, n)\) and \((n, 1)\) to \((n,n)\), while still requiring bounds checking.

  • Tensor shapes: Input vectors have shapes \((1, n)\) and \((n, 1)\)
  • Broadcasting: Output combines both dimensions to \((n,n)\)
  • Guard condition: Still need bounds checking for output size
  • Thread bounds: More threads \((3 \times 3)\) than tensor elements \((2 \times 2)\)

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(1, SIZE)
alias b_layout = Layout.row_major(SIZE, 1)


fn broadcast_add[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, a_layout],
    b: LayoutTensor[mut=False, dtype, b_layout],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p05/p05_layout_tensor.mojo

Tips
  1. Get 2D indices: row = thread_idx.y, col = thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard: think about how to broadcast values of a and b as LayoutTensors

Running the code

To test your solution, run the following command in your terminal:

uv run poe p05_layout_tensor
pixi run p05_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])

Solution

fn broadcast_add[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    output: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, a_layout],
    b: LayoutTensor[mut=False, dtype, b_layout],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:
        output[row, col] = a[0, col] + b[row, 0]


This solution demonstrates key concepts of LayoutTensor broadcasting and GPU thread mapping:

  1. Thread to matrix mapping

    • Uses thread_idx.y for row access and thread_idx.x for column access
    • Natural 2D indexing matches the output matrix structure
    • Excess threads (3Ɨ3 grid) are handled by bounds checking
  2. Broadcasting mechanics

    • Input a has shape (1,n): a[0,col] broadcasts across rows
    • Input b has shape (n,1): b[row,0] broadcasts across columns
    • Output has shape (n,n): Each element is sum of corresponding broadcasts
    [ a0 a1 ]  +  [ b0 ]  =  [ a0+b0  a1+b0 ]
                  [ b1 ]     [ a0+b1  a1+b1 ]
    
  3. Bounds Checking

    • Guard condition row < size and col < size prevents out-of-bounds access
    • Handles both matrix bounds and excess threads efficiently
    • No need for separate checks for a and b due to broadcasting

This pattern forms the foundation for more complex tensor operations we’ll explore in later puzzles.