Overview

Implement a kernel that broadcast adds vector a and vector b and stores it in 2D matrix out.

Note: You have more threads than positions.

Key concepts

In this puzzle, you’ll learn about:

  • Broadcasting 1D vectors across different dimensions
  • Using 2D thread indices for broadcast operations
  • Handling boundary conditions in broadcast patterns

The key insight is understanding how to map elements from two 1D vectors to create a 2D output matrix through broadcasting, while handling thread bounds correctly.

  • Broadcasting: Each element of a combines with each element of b
  • Thread mapping: 2D thread grid \((3 \times 3)\) for \(2 \times 2\) output
  • Vector access: Different access patterns for a and b
  • Bounds checking: Guard against threads outside matrix dimensions

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn broadcast_add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p05/p05.mojo

Tips
  1. Get 2D indices: row = thread_idx.y, col = thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard: think about how to broadcast values of a and b

Running the code

To test your solution, run the following command in your terminal:

uv run poe p05
pixi run p05

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])

Solution

fn broadcast_add(
    output: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:
        output[row * size + col] = a[col] + b[row]


This solution demonstrates fundamental GPU broadcasting concepts without LayoutTensor abstraction:

  1. Thread to matrix mapping

    • Uses thread_idx.y for row access and thread_idx.x for column access
    • Direct mapping from 2D thread grid to output matrix elements
    • Handles excess threads (3×3 grid) for 2×2 output matrix
  2. Broadcasting mechanics

    • Vector a broadcasts horizontally: same a[col] used across each row
    • Vector b broadcasts vertically: same b[row] used across each column
    • Output combines both vectors through addition
    [ a0 a1 ]  +  [ b0 ]  =  [ a0+b0  a1+b0 ]
                  [ b1 ]     [ a0+b1  a1+b1 ]
    
  3. Bounds checking

    • Single guard condition row < size and col < size handles both dimensions
    • Prevents out-of-bounds access for both input vectors and output matrix
    • Required due to 3×3 thread grid being larger than 2×2 data

Compare this with the LayoutTensor version to see how the abstraction simplifies broadcasting operations while maintaining the same underlying concepts.