Overview

Implement a kernel that adds 10 to each position of a 1D ayoutTensor a and stores it in 1D LayoutTensor output.

Note: You have fewer threads per block than the size of a.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor’s shared memory features
  • Thread synchronization with shared memory
  • Block-local data management with tensor builder

The key insight is how LayoutTensor simplifies shared memory management while maintaining the performance benefits of block-local storage.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 4
  • Number of blocks: 2
  • Shared memory: TPB elements per block

Key differences from raw approach

  1. Memory allocation: We will use LayoutTensorBuild instead of stack_allocation

    # Raw approach
    shared = stack_allocation[TPB, Scalar[dtype]]()
    
    # LayoutTensor approach
    shared = LayoutTensorBuild[dtype]().row_major[TPB]().shared().alloc()
    
  2. Memory access: Same syntax

    # Raw approach
    shared[local_i] = a[global_i]
    
    # LayoutTensor approach
    shared[local_i] = a[global_i]
    
  3. Safety features:

    • Type safety
    • Layout management
    • Memory alignment handling

Note: LayoutTensor handles memory layout, but you still need to manage thread synchronization with barrier() when using shared memory.

Educational Note: In this specific puzzle, the barrier() isn’t strictly necessary since each thread only accesses its own shared memory location. However, it’s included to teach proper shared memory synchronization patterns for more complex scenarios where threads need to coordinate access to shared data.

Code to complete

alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn add_10_shared_layout_tensor[
    layout: Layout
](
    output: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # FILL ME IN (roughly 2 lines)


View full file: problems/p08/p08_layout_tensor.mojo

Tips
  1. Create shared memory with tensor builder
  2. Load data with natural indexing: shared[local_i] = a[global_i]
  3. Synchronize with barrier() (educational - not strictly needed here)
  4. Process data using shared memory indices
  5. Guard against out-of-bounds access

Running the code

To test your solution, run the following command in your terminal:

uv run poe p08_layout_tensor
pixi run p08_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

Solution

fn add_10_shared_layout_tensor[
    layout: Layout
](
    output: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    # Note: barrier is not strictly needed here since each thread only accesses its own shared memory location.
    # However, it's included to teach proper shared memory synchronization patterns
    # for more complex scenarios where threads need to coordinate access to shared data.
    # For this specific puzzle, we can remove the barrier since each thread only accesses its own shared memory location.
    barrier()

    if global_i < size:
        output[global_i] = shared[local_i] + 10


This solution demonstrates how LayoutTensor simplifies shared memory usage while maintaining performance:

  1. Memory hierarchy with LayoutTensor

    • Global tensors: a and output (slow, visible to all blocks)
    • Shared tensor: shared (fast, thread-block local)
    • Example for 8 elements with 4 threads per block:
      Global tensor a: [1 1 1 1 | 1 1 1 1]  # Input: all ones
      
      Block (0):         Block (1):
      shared[0..3]       shared[0..3]
      [1 1 1 1]          [1 1 1 1]
      
  2. Thread coordination

    • Load phase with natural indexing:
      Thread 0: shared[0] = a[0]=1    Thread 2: shared[2] = a[2]=1
      Thread 1: shared[1] = a[1]=1    Thread 3: shared[3] = a[3]=1
      barrier()    ↓         ↓        ↓         ↓   # Wait for all loads
      
    • Process phase: Each thread adds 10 to its shared tensor value
    • Result: output[global_i] = shared[local_i] + 10 = 11

    Note: In this specific case, the barrier() isn’t strictly necessary since each thread only writes to and reads from its own shared memory location (shared[local_i]). However, it’s included for educational purposes to demonstrate proper shared memory synchronization patterns that are essential when threads need to access each other’s data.

  3. LayoutTensor benefits

    • Shared memory allocation:
      # Clean tensor builder API
      shared = tb[dtype]().row_major[TPB]().shared().alloc()
      
    • Natural indexing for both global and shared:
      Block 0 output: [11 11 11 11]
      Block 1 output: [11 11 11 11]
      
    • Built-in layout management and type safety
  4. Memory access pattern

    • Load: Global tensor → Shared tensor (optimized)
    • Sync: Same barrier() requirement as raw version
    • Process: Add 10 to shared values
    • Store: Write 11s back to global tensor

This pattern shows how LayoutTensor maintains the performance benefits of shared memory while providing a more ergonomic API and built-in features.