Mojo🔥 GPU Puzzles

Overview

Implement a kernel that adds 10 to each position of a 1D ayoutTensor a and stores it in 1D LayoutTensor output.

Note: You have fewer threads per block than the size of a.

Key concepts

In this puzzle, you’ll learn about:

Using LayoutTensor’s shared memory features
Thread synchronization with shared memory
Block-local data management with tensor builder

The key insight is how LayoutTensor simplifies shared memory management while maintaining the performance benefits of block-local storage.

Configuration

Array size: SIZE = 8 elements
Threads per block: TPB = 4
Number of blocks: 2
Shared memory: TPB elements per block

Key differences from raw approach

Memory allocation: We will use LayoutTensorBuild instead of stack_allocation

# Raw approach
shared = stack_allocation[TPB, Scalar[dtype]]()

# LayoutTensor approach
shared = LayoutTensorBuild[dtype]().row_major[TPB]().shared().alloc()

Memory access: Same syntax

# Raw approach
shared[local_i] = a[global_i]

# LayoutTensor approach
shared[local_i] = a[global_i]

Safety features:
- Type safety
- Layout management
- Memory alignment handling

Note: LayoutTensor handles memory layout, but you still need to manage thread synchronization with barrier() when using shared memory.

Code to complete

alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn add_10_shared_layout_tensor[
    layout: Layout
](
    output: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # FILL ME IN (roughly 2 lines)

View full file: problems/p08/p08_layout_tensor.mojo

Tips

Create shared memory with tensor builder
Load data with natural indexing: shared[local_i] = a[global_i]
Synchronize with barrier()
Process data using shared memory indices
Guard against out-of-bounds access

Running the code

To test your solution, run the following command in your terminal:

uv run poe p08_layout_tensor

pixi run p08_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

Solution

fn add_10_shared_layout_tensor[
    layout: Layout
](
    output: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    if global_i < size:
        output[global_i] = shared[local_i] + 10

This solution demonstrates how LayoutTensor simplifies shared memory usage while maintaining performance:

Memory hierarchy with LayoutTensor
- Global tensors: a and output (slow, visible to all blocks)
- Shared tensor: shared (fast, thread-block local)
- Example for 8 elements with 4 threads per block:
```
Global tensor a: [1 1 1 1 | 1 1 1 1]  # Input: all ones

Block (0):         Block (1):
shared[0..3]       shared[0..3]
[1 1 1 1]          [1 1 1 1]
```

Thread coordination

Load phase with natural indexing:

Thread 0: shared[0] = a[0]=1    Thread 2: shared[2] = a[2]=1
Thread 1: shared[1] = a[1]=1    Thread 3: shared[3] = a[3]=1
barrier()    ↓         ↓        ↓         ↓   # Wait for all loads

Process phase: Each thread adds 10 to its shared tensor value
Result: output[global_i] = shared[local_i] + 10 = 11

LayoutTensor benefits

Shared memory allocation:

# Clean tensor builder API
shared = tb[dtype]().row_major[TPB]().shared().alloc()

Natural indexing for both global and shared:

Block 0 output: [11 11 11 11]
Block 1 output: [11 11 11 11]

Built-in layout management and type safety

Memory access pattern
- Load: Global tensor → Shared tensor (optimized)
- Sync: Same barrier() requirement as raw version
- Process: Add 10 to shared values
- Store: Write 11s back to global tensor

This pattern shows how LayoutTensor maintains the performance benefits of shared memory while providing a more ergonomic API and built-in features.