Overview

Implement a kernel that adds 10 to each position of a 1D ayoutTensor a and stores it in 1D LayoutTensor out.

Note: You have fewer threads per block than the size of a.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor’s shared memory features
  • Thread synchronization with shared memory
  • Block-local data management with tensor builder

The key insight is how LayoutTensor simplifies shared memory management while maintaining the performance benefits of block-local storage.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 4
  • Number of blocks: 2
  • Shared memory: TPB elements per block

Key differences from raw approach

  1. Memory allocation: We will use LayoutTensorBuild instead of stack_allocation

    # Raw approach
    shared = stack_allocation[TPB, Scalar[dtype]]()
    
    # LayoutTensor approach
    shared = LayoutTensorBuild[dtype]().row_major[TPB]().shared().alloc()
    
  2. Memory access: Same syntax

    # Raw approach
    shared[local_i] = a[global_i]
    
    # LayoutTensor approach
    shared[local_i] = a[global_i]
    
  3. Safety features:

    • Type safety
    • Layout management
    • Memory alignment handling

Note: LayoutTensor handles memory layout, but you still need to manage thread synchronization with barrier() when using shared memory.

Code to complete

alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn add_10_shared_layout_tensor[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # FILL ME IN (roughly 2 lines)


View full file: problems/p08/p08_layout_tensor.mojo

Tips
  1. Create shared memory with tensor builder
  2. Load data with natural indexing: shared[local_i] = a[global_i]
  3. Synchronize with barrier()
  4. Process data using shared memory indices
  5. Guard against out-of-bounds access

Running the code

To test your solution, run the following command in your terminal:

uv run poe p08_layout_tensor
pixi run p08_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

Solution

fn add_10_shared_layout_tensor[
    layout: Layout
](
    output: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    if global_i < size:
        output[global_i] = shared[local_i] + 10


This solution demonstrates how LayoutTensor simplifies shared memory usage while maintaining performance:

  1. Memory hierarchy with LayoutTensor

    • Global tensors: a and out (slow, visible to all blocks)
    • Shared tensor: shared (fast, thread-block local)
    • Example for 8 elements with 4 threads per block:
      Global tensor a: [1 1 1 1 | 1 1 1 1]  # Input: all ones
      
      Block (0):         Block (1):
      shared[0..3]       shared[0..3]
      [1 1 1 1]          [1 1 1 1]
      
  2. Thread coordination

    • Load phase with natural indexing:
      Thread 0: shared[0] = a[0]=1    Thread 2: shared[2] = a[2]=1
      Thread 1: shared[1] = a[1]=1    Thread 3: shared[3] = a[3]=1
      barrier()    ↓         ↓        ↓         ↓   # Wait for all loads
      
    • Process phase: Each thread adds 10 to its shared tensor value
    • Result: out[global_i] = shared[local_i] + 10 = 11
  3. LayoutTensor benefits

    • Shared memory allocation:
      # Clean tensor builder API
      shared = tb[dtype]().row_major[TPB]().shared().alloc()
      
    • Natural indexing for both global and shared:
      Block 0 output: [11 11 11 11]
      Block 1 output: [11 11 11 11]
      
    • Built-in layout management and type safety
  4. Memory access pattern

    • Load: Global tensor → Shared tensor (optimized)
    • Sync: Same barrier() requirement as raw version
    • Process: Add 10 to shared values
    • Store: Write 11s back to global tensor

This pattern shows how LayoutTensor maintains the performance benefits of shared memory while providing a more ergonomic API and built-in features.