Mojo GPU Puzzles Logo

Mojo🔥 GPU Puzzles

🚧 This book is a work in progress! Some sections may be incomplete or subject to change. 🚧

“For the things we have to learn before we can do them, we learn by doing them.” Aristotle, (Nicomachean Ethics)

Welcome to Mojo 🔥 GPU Puzzles, a hands-on guide to mastering GPU programming using Mojo 🔥 — the innovative programming language that combines Pythonic syntax with systems-level performance. GPU programming remains one of the most powerful skills in modern computing, driving advances in artificial intelligence, scientific simulation, and high-performance computing.

This book takes a unique approach to teaching GPU programming: learning by solving increasingly challenging puzzles. Rather than traditional textbook learning, you’ll immediately start writing real GPU code and seeing the results.

The first two chapters of this book are heavily inspired by GPU Puzzles, an interactive CUDA learning project. This adaptation reimplements these concepts using Mojo’s powerful abstractions and performance capabilities, while expanding on advanced topics with Mojo-specific optimizations.

Why Mojo 🔥 for GPU Programming?

The computing industry has reached a critical inflection point. We can no longer rely on new CPU generations to automatically increase application performance through higher clock speeds. As power and heat constraints have plateaued CPU speeds, hardware manufacturers have shifted toward increasing the number of physical cores. This multi-core revolution has reached its zenith in modern GPUs, which contain thousands of cores operating in parallel. The NVIDIA H100, for example, can run an astonishing 16,896 threads simultaneously in a single clock cycle, with over 270,000 threads queued and ready for execution.

Mojo represents a fresh approach to GPU programming, making this massive parallelism more accessible and productive:

  • Python-like Syntax with systems programming capabilities that feels familiar to the largest programming community
  • Zero-cost Abstractions that compile to efficient machine code without sacrificing performance
  • Strong Type System that catches errors at compile time while maintaining expressiveness
  • Built-in Tensor Support with hardware-aware optimizations specifically designed for GPU computation
  • Direct Access to low-level CPU and GPU intrinsics for systems-level programming
  • Cross-Hardware Portability allowing you to write code that can run on both CPUs and GPUs
  • Ergonomic and Safety Improvements over traditional C/C++ GPU programming
  • Lower Barrier to Entry enabling more programmers to harness GPU power effectively

Mojo 🔥 aims to fuel innovation by democratizing GPU programming. >By expanding on Python’s familiar syntax while adding direct GPU access, Mojo empowers programmers with minimal specialized knowledge to build high-performance, heterogeneous (CPU, GPU-enabled) applications.

The GPU Programming Mindset

Effective GPU programming requires a fundamental shift in how we think about computation. Here are some key mental models that will guide your journey:

From Sequential to Parallel: Eliminating Loops with Threads

In traditional CPU programming, we process data sequentially through loops:

# CPU approach
for i in range(data_size):
    result[i] = process(data[i])

With GPUs, we flip this model entirely. Instead of moving sequentially through data, we map thousands of parallel threads directly onto the data:

# GPU approach (conceptual)
thread_id = get_global_id()
if thread_id < data_size:
    result[thread_id] = process(data[thread_id])

Each thread becomes responsible for computing a single element, eliminating the need for explicit loops. This mental shift—from “stepping through data” to “blanketing data with compute”—is central to GPU programming.

Fitting a Mesh of Compute Over Data

Imagine your data as a grid, and GPU threads as another grid that overlays it. Your task is to design this “compute mesh” to efficiently cover your data:

  • Threads: Individual compute units that process single data elements
  • Blocks: Organized groups of threads that share fast memory
  • Grid: The entire collection of blocks that covers your dataset

The art of GPU programming lies in crafting this mesh to maximize parallelism while respecting memory and synchronization constraints.

Data Movement vs. Computation

In GPU programming, data movement is often more expensive than computation:

  • Moving data between CPU and GPU is slow
  • Moving data between global and shared memory is faster
  • Operating on data already in registers or shared memory is extremely fast

This inverts another common assumption in programming: computation is no longer the bottleneck—data movement is.

Through the puzzles in this book, you’ll develop an intuitive understanding of these principles, transforming how you approach computational problems.

What You Will Learn

This book takes you on a journey from first principles to advanced GPU programming techniques. Rather than treating the GPU as a mysterious black box, we’ll build your understanding layer by layer—starting with how individual threads operate and culminating in sophisticated parallel algorithms. By mastering both low-level memory management and high-level tensor abstractions, you’ll gain the versatility to tackle any GPU programming challenge.

Your Complete Learning Path

The book is structured into ten progressive parts, each building on the previous to create a comprehensive GPU programming education:

Essential SkillCovered In
Thread/Block basicsPart I (1-8)
Core algorithmsPart II (9-14)
MAX Graph integrationPart III (15-17)
PyTorch integrationPart IV (18-19)
Functional patterns & benchmarkingPart V (20-21)
Warp programmingPart VI (22-23)
Memory optimizationPart VII (24-27)
Performance analysisPart VIII (28-30)
Modern GPU featuresPart IX (31-33)
Scaling upPart X (34-36)

Detailed Learning Objectives

Part I: GPU Fundamentals

  • Master thread indexing and block organization
  • Understand memory access patterns and guards
  • Work with both raw pointers and LayoutTensor abstractions
  • Learn shared memory basics for inter-thread communication

Part II: GPU Algorithms

  • Implement parallel reductions and pooling operations
  • Build efficient convolution kernels
  • Master prefix sum (scan) algorithms
  • Optimize matrix multiplication with tiling strategies

Part III: MAX Graph Integration

  • Create custom MAX Graph operations
  • Interface GPU kernels with Python code
  • Build production-ready operations like softmax and attention

Part IV: PyTorch Integration

  • Bridge Mojo GPU kernels with PyTorch tensors
  • Use CustomOpLibrary for seamless tensor marshalling
  • Integrate with torch.compile for optimized execution

Part V: Mojo Functional Patterns & Benchmarking

  • Master essential functional patterns: elementwise, parallelize, vectorize, tile
  • Learn systematic performance optimization with tile_and_unswitch and unswitch
  • Develop quantitative benchmarking skills for informed decision-making

Part VI: Warp-Level Programming

  • Understand when to use warp programming vs functional patterns
  • Master essential warp operations: reduce_add, shuffle_down, vote_all
  • Learn to combine warp programming with functional patterns effectively

Part VII: Advanced Memory Operations

  • Achieve optimal memory coalescing patterns
  • Use async memory operations for overlapping compute
  • Implement memory fences and atomic operations
  • Master prefetching and cache optimization

Part VIII: Performance Analysis & Optimization

  • Profile GPU kernels and identify bottlenecks
  • Optimize occupancy and resource utilization
  • Eliminate shared memory bank conflicts

Part IX: Advanced GPU Features

  • Program tensor cores for AI workloads
  • Implement GPU-based random number generation
  • Master advanced synchronization patterns

Part X: Multi-GPU & Advanced Applications

  • Implement multi-stream concurrent execution
  • Scale across multiple GPUs
  • Build end-to-end optimized applications

The book uniquely challenges the status quo approach by first building understanding with low-level memory manipulation, then gradually transitioning to Mojo’s powerful LayoutTensor abstractions. This gives you both deep understanding of GPU memory patterns and practical knowledge of modern tensor-based approaches.

🏆 Prizes and Rewards 🎉

Have you completed the available puzzles? We’re giving away free sticker packs to celebrate your achievement!

To claim your free stickers:

  1. Fork the GitHub repository https://github.com/modular/mojo-gpu-puzzles
  2. Add your solutions to the available puzzles
  3. Submit your solutions through this form and we’ll send you exclusive Modular stickers!

Note: More puzzles are being added regularly - complete the currently available ones to claim your reward!

How to Use This Book

Each puzzle follows a consistent format designed to progressively build your skills:

  • Overview: Clear problem statement and key concepts introduced in each puzzle
  • Configuration: Setup parameters and memory organization specific to each challenge
  • Code to Complete: Skeleton code with specific sections for you to implement
  • Tips: Optional hints if you get stuck, without giving away complete solutions
  • Solution: Detailed explanations of the implementation, performance considerations, and underlying concepts

The puzzles gradually increase in complexity, introducing new concepts while reinforcing fundamentals. We recommend solving them in order, as later puzzles build on skills developed in earlier ones.

Running the code

All puzzles are designed to be run with the provided testing framework that verifies your implementation against expected results. Each puzzle includes instructions for running the code and validating your solution.

Prerequisites

Compatible GPU

You’ll need a compatible GPU to run the puzzles.

Setting up your environment

  1. Clone the GitHub repository and navigate to the repository:

    # Clone the repository
    git clone https://github.com/modular/mojo-gpu-puzzles
    cd mojo-gpu-puzzles
    
  2. Install a package manager to run the Mojo🔥 programs:

    Install:

    curl -fsSL https://astral.sh/uv/install.sh | sh
    

    Update:

    uv self update
    

    Create a virtual environment:

    uv venv && source .venv/bin/activate
    

    Install:

    curl -fsSL https://pixi.sh/install.sh | sh
    

    Update:

    pixi self-update
    
  3. Run the puzzles via uv or pixi as follows:

    uv run poe pXX  # Replace XX with the puzzle number
    
    pixi run pXX  # Replace XX with the puzzle number
    

For example, to run puzzle 01:

  • uv run poe p01 or
  • pixi run p01

Knowledge prerequisites

Basic knowledge of:

  • Programming fundamentals (variables, loops, conditionals, functions)
  • Parallel computing concepts (threads, synchronization, race conditions)
  • Basic familiarity with Mojo (language basics parts and intro to pointers section)
  • A tour of GPU basics in Mojo is helpful

No prior GPU programming experience is necessary! We’ll build that knowledge through the puzzles.

Let’s begin our journey into the exciting world of GPU computing with Mojo 🔥!

Join the community

Subscribe for Updates Modular Forum Discord

Join our vibrant community to discuss GPU programming, share solutions, and get help!

Puzzle 1: Map

Overview

GPU programming is all about parallelism. In this puzzle, each thread will process a single element of the input array independently. Implement a kernel that adds 10 to each position of vector a and stores it in vector out.

Note: You have 1 thread per position.

Map

Key concepts

  • Basic GPU kernel structure
  • One-to-one thread to data mapping
  • Memory access patterns
  • Array operations on GPU

For each position \(i\): \[\Large out[i] = a[i] + 10\]

What we cover

🔰 Raw Memory Approach

Start with direct memory manipulation to understand GPU fundamentals.

💡 Preview: Modern Approach with LayoutTensor

See how LayoutTensor simplifies GPU programming with safer, cleaner code.

💡 Tip: Understanding both approaches helps you better appreciate modern GPU programming patterns.

Key concepts

In this puzzle, you’ll learn about:

  • Basic GPU kernel structure

  • Thread indexing with thread_idx.x

  • Simple parallel operations

  • Parallelism: Each thread executes independently

  • Thread indexing: Access element at position i = thread_idx.x

  • Memory access: Read from a[i] and write to out[i]

  • Data independence: Each output depends only on its corresponding input

Code to complete

alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = SIZE
alias dtype = DType.float32


fn add_10(out: UnsafePointer[Scalar[dtype]], a: UnsafePointer[Scalar[dtype]]):
    i = thread_idx.x
    # FILL ME IN (roughly 1 line)


View full file: problems/p01/p01.mojo

Tips
  1. Store thread_idx.x in i
  2. Add 10 to a[i]
  3. Store result in out[i]

Running the code

To test your solution, run the following command in your terminal:

uv run poe p01
pixi run p01

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

Solution

fn add_10(out: UnsafePointer[Scalar[dtype]], a: UnsafePointer[Scalar[dtype]]):
    i = thread_idx.x
    out[i] = a[i] + 10.0


This solution:

  • Gets thread index with i = thread_idx.x
  • Adds 10 to input value: out[i] = a[i] + 10.0

Why Consider LayoutTensor?

Looking at our traditional implementation above, you might notice some potential issues:

Current approach

i = thread_idx.x
out[i] = a[i] + 10.0

This works for 1D arrays, but what happens when we need to:

  • Handle 2D or 3D data?
  • Deal with different memory layouts?
  • Ensure coalesced memory access?

Preview of future challenges

As we progress through the puzzles, array indexing will become more complex:

# 2D indexing coming in later puzzles
idx = row * WIDTH + col

# 3D indexing
idx = (batch * HEIGHT + row) * WIDTH + col

# With padding
idx = (batch * padded_height + row) * padded_width + col

LayoutTensor preview

LayoutTensor will help us handle these cases more elegantly:

# Future preview - don't worry about this syntax yet!
out[i, j] = a[i, j] + 10.0  # 2D indexing
out[b, i, j] = a[b, i, j] + 10.0  # 3D indexing

We’ll learn about LayoutTensor in detail in Puzzle 4, where these concepts become essential. For now, focus on understanding:

  • Basic thread indexing
  • Simple memory access patterns
  • One-to-one mapping of threads to data

💡 Key Takeaway: While direct indexing works for simple cases, we’ll soon need more sophisticated tools for complex GPU programming patterns.

Puzzle 2: Zip

Overview

Implement a kernel that adds together each position of vector a and vector b and stores it in out.

Note: You have 1 thread per position.

Zip

Key concepts

In this puzzle, you’ll learn about:

  • Processing multiple input arrays in parallel
  • Element-wise operations with multiple inputs
  • Thread-to-data mapping across arrays
  • Memory access patterns with multiple arrays

For each thread \(i\): \[\Large out[i] = a[i] + b[i]\]

Memory access pattern

Thread 0:  a[0] + b[0] → out[0]
Thread 1:  a[1] + b[1] → out[1]
Thread 2:  a[2] + b[2] → out[2]
...

💡 Note: Notice how we’re now managing three arrays (a, b, out) in our kernel. As we progress to more complex operations, managing multiple array accesses will become increasingly challenging.

Code to complete

alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = SIZE
alias dtype = DType.float32


fn add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
):
    i = thread_idx.x
    # FILL ME IN (roughly 1 line)


View full file: problems/p02/p02.mojo

Tips
  1. Store thread_idx.x in i
  2. Add a[i] and b[i]
  3. Store result in out[i]

Running the code

To test your solution, run the following command in your terminal:

uv run poe p02
pixi run p02

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 2.0, 4.0, 6.0])

Solution

fn add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
):
    i = thread_idx.x
    out[i] = a[i] + b[i]


This solution:

  • Gets thread index with i = thread_idx.x
  • Adds values from both arrays: out[i] = a[i] + b[i]

Looking ahead

While this direct indexing works for simple element-wise operations, consider:

  • What if arrays have different layouts?
  • What if we need to broadcast one array to another?
  • How to ensure coalesced access across multiple arrays?

These questions will be addressed when we introduce LayoutTensor in Puzzle 4.

Puzzle 3: Guards

Overview

Implement a kernel that adds 10 to each position of vector a and stores it in vector out.

Note: You have more threads than positions. This means you need to protect against out-of-bounds memory access.

Guard

Key concepts

In this puzzle, you’ll learn about:

  • Handling thread/data size mismatches
  • Preventing out-of-bounds memory access
  • Using conditional execution in GPU kernels
  • Safe memory access patterns

Mathematical Description

For each thread \(i\): \[\Large \text{if}\ i < \text{size}: out[i] = a[i] + 10\]

Memory Safety Pattern

Thread 0 (i=0):  if 0 < size:  out[0] = a[0] + 10  ✓ Valid
Thread 1 (i=1):  if 1 < size:  out[1] = a[1] + 10  ✓ Valid
Thread 2 (i=2):  if 2 < size:  out[2] = a[2] + 10  ✓ Valid
Thread 3 (i=3):  if 3 < size:  out[3] = a[3] + 10  ✓ Valid
Thread 4 (i=4):  if 4 < size:  ❌ Skip (out of bounds)
Thread 5 (i=5):  if 5 < size:  ❌ Skip (out of bounds)

💡 Note: Boundary checking becomes increasingly complex with:

  • Multi-dimensional arrays
  • Different array shapes
  • Complex access patterns

Code to complete

alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (8, 1)
alias dtype = DType.float32


fn add_10_guard(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    i = thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p03/p03.mojo

Tips
  1. Store thread_idx.x in i
  2. Add guard: if i < size
  3. Inside guard: out[i] = a[i] + 10.0

Running the code

To test your solution, run the following command in your terminal:

uv run poe p03
pixi run p03

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

Solution

fn add_10_guard(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    i = thread_idx.x
    if i < size:
        out[i] = a[i] + 10.0


This solution:

  • Gets thread index with i = thread_idx.x
  • Guards against out-of-bounds access with if i < size
  • Inside guard: adds 10 to input value

Looking ahead

While simple boundary checks work here, consider these challenges:

  • What about 2D/3D array boundaries?
  • How to handle different shapes efficiently?
  • What if we need padding or edge handling?

Example of growing complexity:

# Current: 1D bounds check
if i < size: ...

# Coming soon: 2D bounds check
if i < height and j < width: ...

# Later: 3D with padding
if i < height and j < width and k < depth and
   i >= padding and j >= padding: ...

These boundary handling patterns will become more elegant when we learn about LayoutTensor in Puzzle 4, which provides built-in boundary checking and shape management.

Puzzle 4: 2D Map

Overview

Implement a kernel that adds 10 to each position of 2D square matrix a and stores it in 2D square matrix out.

Note: You have more threads than positions.

2D Matrix Mapping

Key concepts

  • 2D thread indexing
  • Matrix operations on GPU
  • Handling excess threads
  • Memory layout patterns

For each position \((i,j)\): \[\Large out[i,j] = a[i,j] + 10\]

Thread indexing convention

When working with 2D matrices in GPU programming, we follow a natural mapping between thread indices and matrix coordinates:

  • thread_idx.y corresponds to the row index
  • thread_idx.x corresponds to the column index

2D thread indexing

This convention aligns with:

  1. The standard mathematical notation where matrix positions are specified as (row, column)
  2. The visual representation of matrices where rows go top-to-bottom (y-axis) and columns go left-to-right (x-axis)
  3. Common GPU programming patterns where thread blocks are organized in a 2D grid matching the matrix structure

Historical origins

While graphics and image processing typically use \((x,y)\) coordinates, matrix operations in computing have historically used (row, column) indexing. This comes from how early computers stored and processed 2D data: line by line, top to bottom, with each line read left to right. This row-major memory layout proved efficient for both CPUs and GPUs, as it matches how they access memory sequentially. When GPU programming adopted thread blocks for parallel processing, it was natural to map thread_idx.y to rows and thread_idx.x to columns, maintaining consistency with established matrix indexing conventions.

Implementation approaches

🔰 Raw memory approach

Learn how 2D indexing works with manual memory management.

📚 Learn about LayoutTensor

Discover a powerful abstraction that simplifies multi-dimensional array operations and memory management on GPU.

🚀 Modern 2D operations

Put LayoutTensor into practice with natural 2D indexing and automatic bounds checking.

💡 Note: From this puzzle onward, we’ll primarily use LayoutTensor for cleaner, safer GPU code.

Overview

Implement a kernel that adds 10 to each position of 2D square matrix a and stores it in 2D square matrix out.

Note: You have more threads than positions.

Key concepts

In this puzzle, you’ll learn about:

  • Working with 2D thread indices (thread_idx.x, thread_idx.y)
  • Converting 2D coordinates to 1D memory indices
  • Handling boundary checks in two dimensions

The key insight is understanding how to map from 2D thread coordinates \((i,j)\) to elements in a row-major matrix of size \(n \times n\), while ensuring thread indices are within bounds.

  • 2D indexing: Each thread has a unique \((i,j)\) position
  • Memory layout: Row-major ordering maps 2D to 1D memory
  • Guard condition: Need bounds checking in both dimensions
  • Thread bounds: More threads \((3 \times 3)\) than matrix elements \((2 \times 2)\)

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn add_10_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p04/p04.mojo

Tips
  1. Get 2D indices: row = thread_idx.y, col = thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard add 10 in row-major way!

Running the code

To test your solution, run the following command in your terminal:

uv run poe p04
pixi run p04

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

Solution

fn add_10_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:
        out[row * size + col] = a[row * size + col] + 10.0


This solution:

  1. Get 2D indices: row = thread_idx.y, col = thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard: out[row * size + col] = a[row * size + col] + 10.0

Introduction to LayoutTensor

Let’s take a quick break from solving puzzles to preview a powerful abstraction that will make our GPU programming journey more enjoyable: 🥁 … the LayoutTensor.

💡 This is a motivational overview of LayoutTensor’s capabilities. Don’t worry about understanding everything now - we’ll explore each feature in depth as we progress through the puzzles.

The challenge: Growing complexity

Let’s look at the challenges we’ve faced so far:

# Puzzle 1: Simple indexing
out[i] = a[i] + 10.0

# Puzzle 2: Multiple array management
out[i] = a[i] + b[i]

# Puzzle 3: Bounds checking
if i < size:
    out[i] = a[i] + 10.0

As dimensions grow, code becomes more complex:

# Traditional 2D indexing for row-major 2D matrix
idx = row * WIDTH + col
if row < height and col < width:
    out[idx] = a[idx] + 10.0

The solution: A peek at LayoutTensor

LayoutTensor will help us tackle these challenges with elegant solutions. Here’s a glimpse of what’s coming:

  1. Natural Indexing: Use tensor[i, j] instead of manual offset calculations
  2. Automatic Bounds Checking: Built-in protection against out-of-bounds access
  3. Flexible Memory Layouts: Support for row-major, column-major, and tiled organizations
  4. Performance Optimization: Efficient memory access patterns for GPU

A taste of what’s ahead

Let’s look at a few examples of what LayoutTensor can do. Don’t worry about understanding all the details now - we’ll cover each feature thoroughly in upcoming puzzles.

Basic usage example

from layout import Layout, LayoutTensor

# Define layout
alias HEIGHT = 2
alias WIDTH = 3
alias layout = Layout.row_major(HEIGHT, WIDTH)

# Create tensor
tensor = LayoutTensor[dtype, layout](buffer.unsafe_ptr())

# Access elements naturally
tensor[0, 0] = 1.0  # First element
tensor[1, 2] = 2.0  # Last element

Preview of advanced features

As we progress through the puzzles, you’ll learn about:

  • Shared memory optimizations
  • Efficient tiling strategies
  • Vectorized operations
  • Hardware acceleration
  • Optimized memory access patterns
# Column-major layout
layout_col = Layout.col_major(HEIGHT, WIDTH)

# Tiled layout (for better cache utilization)
layout_tiled = tensor.tiled[4, 4](HEIGHT, WIDTH)

Each layout has its advantages:

  • Row-major: Elements in a row are contiguous

    # [1 2 3]
    # [4 5 6] -> [1 2 3 4 5 6]
    layout_row = Layout.row_major(2, 3)
    
  • Column-major: Elements in a column are contiguous

    # [1 2 3]
    # [4 5 6] -> [1 4 2 5 3 6]
    layout_col = Layout.col_major(2, 3)
    
  • Tiled: Elements grouped in tiles for cache efficiency

    # [[1 2] [3 4]] in 2x2 tiles
    layout_tiled = Layout.tiled[2, 2](4, 4)
    

Advanced GPU optimizations

As you progress, you’ll discover LayoutTensor’s powerful features for GPU programming:

  1. Memory hierarchy management
# Shared memory allocation
shared_mem = tb[dtype]().row_major[BM, BK]().shared().alloc()

# Register allocation
reg_tile = tb[dtype]().row_major[TM, TN]().local().alloc()
  1. Tiling strategies
# Block tiling
block_tile = tensor.tile[BM, BN](block_idx.y, block_idx.x)

# Register tiling
reg_tile = block_tile.tile[TM, TN](thread_row, thread_col)
  1. Memory access patterns
# Vectorized access
vec_tensor = tensor.vectorize[1, simd_width]()

# Asynchronous transfers
copy_dram_to_sram_async[thread_layout=layout](dst, src)
  1. Hardware acceleration
# Tensor Core operations (coming in later puzzles)
mma_op = TensorCore[dtype, out_type, Index(M, N, K)]()
result = mma_op.mma_op(a_reg, b_reg, c_reg)

💡 Looking ahead: Through these puzzles, you’ll learn to:

  • Optimize data access with shared memory
  • Implement efficient tiling strategies
  • Leverage vectorized operations
  • Utilize hardware accelerators
  • Master memory access patterns

Each concept builds on the last, gradually taking you from basic tensor operations to advanced GPU programming. Ready to begin? Let’s start with the fundamentals!

Quick example

Let’s put everything together with a simple example that demonstrates the basics of LayoutTensor:

from gpu.host import DeviceContext
from layout import Layout, LayoutTensor

alias HEIGHT = 2
alias WIDTH = 3
alias dtype = DType.float32
alias layout = Layout.row_major(HEIGHT, WIDTH)

fn kernel[dtype: DType, layout: Layout](tensor: LayoutTensor[mut=True, dtype, layout]):
    print("Before:")
    print(tensor)
    tensor[0, 0] += 1
    print("After:")
    print(tensor)

def main():
    ctx = DeviceContext()

    a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH).enqueue_fill(0)
    tensor = LayoutTensor[mut=True, dtype, layout](a.unsafe_ptr())
    # Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
    ctx.enqueue_function[kernel[dtype, layout]](tensor, grid_dim=1, block_dim=1)

    ctx.synchronize()

When we run this code with:

uv run poe layout_tensor_intro
pixi run layout_tensor_intro
Before:
0.0 0.0 0.0
0.0 0.0 0.0
After:
1.0 0.0 0.0
0.0 0.0 0.0

Let’s break down what’s happening:

  1. We create a 2 x 3 tensor with row-major layout
  2. Initially, all elements are zero
  3. Using natural indexing, we modify a single element
  4. The change is reflected in our output

This simple example demonstrates key LayoutTensor benefits:

  • Clean syntax for tensor creation and access
  • Automatic memory layout handling
  • Built-in bounds checking
  • Natural multi-dimensional indexing

While this example is straightforward, the same patterns will scale to complex GPU operations in upcoming puzzles. You’ll see how these basic concepts extend to:

  • Multi-threaded GPU operations
  • Shared memory optimizations
  • Complex tiling strategies
  • Hardware-accelerated computations

Ready to start your GPU programming journey with LayoutTensor? Let’s dive into the puzzles!

💡 Tip: Keep this example in mind as we progress - we’ll build upon these fundamental concepts to create increasingly sophisticated GPU programs.

LayoutTensor Version

Overview

Implement a kernel that adds 10 to each position of 2D LayoutTensor a and stores it in 2D LayoutTensor out.

Note: You have more threads than positions.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor for 2D array access
  • Direct 2D indexing with tensor[i, j]
  • Handling bounds checking with LayoutTensor

The key insight is that LayoutTensor provides a natural 2D indexing interface, abstracting away the underlying memory layout while still requiring bounds checking.

  • 2D access: Natural \((i,j)\) indexing with LayoutTensor
  • Memory abstraction: No manual row-major calculation needed
  • Guard condition: Still need bounds checking in both dimensions
  • Thread bounds: More threads \((3 \times 3)\) than tensor elements \((2 \times 2)\)

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE, SIZE)


fn add_10_2d(
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p04/p04_layout_tensor.mojo

Tips
  1. Get 2D indices: row = thread_idx.y, col = thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard add 10 to a[row, col]

Running the code

To test your solution, run the following command in your terminal:

uv run poe p04_layout_tensor
pixi run p04_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

Solution

fn add_10_2d(
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    if col < size and row < size:
        out[row, col] = a[row, col] + 10.0


This solution:

  • Gets 2D thread indices with row = thread_idx.y, col = thread_idx.x
  • Guards against out-of-bounds with if row < size and col < size
  • Uses LayoutTensor’s 2D indexing: out[row, col] = a[row, col] + 10.0

Puzzle 5: Broadcast

Overview

Implement a kernel that broadcast adds vector a and vector b and stores it in 2D matrix out.

Note: You have more threads than positions.

Broadcast visualization

Key concepts

  • Broadcasting vectors to matrix
  • 2D thread management
  • Mixed dimension operations
  • Memory layout patterns

Implementation approaches

🔰 Raw memory approach

Learn how to handle broadcasting with manual memory indexing.

📐 LayoutTensor Version

Use LayoutTensor to handle mixed-dimension operations.

💡 Note: Notice how LayoutTensor simplifies broadcasting compared to manual indexing.

Overview

Implement a kernel that broadcast adds vector a and vector b and stores it in 2D matrix out.

Note: You have more threads than positions.

Key concepts

In this puzzle, you’ll learn about:

  • Broadcasting 1D vectors across different dimensions
  • Using 2D thread indices for broadcast operations
  • Handling boundary conditions in broadcast patterns

The key insight is understanding how to map elements from two 1D vectors to create a 2D output matrix through broadcasting, while handling thread bounds correctly.

  • Broadcasting: Each element of a combines with each element of b
  • Thread mapping: 2D thread grid \((3 \times 3)\) for \(2 \times 2\) output
  • Vector access: Different access patterns for a and b
  • Bounds checking: Guard against threads outside matrix dimensions

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn broadcast_add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p05/p05.mojo

Tips
  1. Get 2D indices: row = thread_idx.y, col = thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard: think about how to broadcast values of a and b

Running the code

To test your solution, run the following command in your terminal:

uv run poe p05
pixi run p05

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])

Solution

fn broadcast_add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:
        out[row * size + col] = a[col] + b[row]


This solution demonstrates fundamental GPU broadcasting concepts without LayoutTensor abstraction:

  1. Thread to matrix mapping

    • Uses thread_idx.y for row access and thread_idx.x for column access
    • Direct mapping from 2D thread grid to output matrix elements
    • Handles excess threads (3×3 grid) for 2×2 output matrix
  2. Broadcasting mechanics

    • Vector a broadcasts horizontally: same a[col] used across each row
    • Vector b broadcasts vertically: same b[row] used across each column
    • Output combines both vectors through addition
    [ a0 a1 ]  +  [ b0 ]  =  [ a0+b0  a1+b0 ]
                  [ b1 ]     [ a0+b1  a1+b1 ]
    
  3. Bounds checking

    • Single guard condition row < size and col < size handles both dimensions
    • Prevents out-of-bounds access for both input vectors and output matrix
    • Required due to 3×3 thread grid being larger than 2×2 data

Compare this with the LayoutTensor version to see how the abstraction simplifies broadcasting operations while maintaining the same underlying concepts.

LayoutTensor Version

Overview

Implement a kernel that broadcast adds 1D LayoutTensor a and 1D LayoutTensor b and stores it in 2D LayoutTensor out.

Note: You have more threads than positions.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor for broadcast operations
  • Working with different tensor shapes
  • Handling 2D indexing with LayoutTensor

The key insight is that LayoutTensor allows natural broadcasting through different tensor shapes: \((1, n)\) and \((n, 1)\) to \((n,n)\), while still requiring bounds checking.

  • Tensor shapes: Input vectors have shapes \((1, n)\) and \((n, 1)\)
  • Broadcasting: Output combines both dimensions to \((n,n)\)
  • Guard condition: Still need bounds checking for output size
  • Thread bounds: More threads \((3 \times 3)\) than tensor elements \((2 \times 2)\)

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(1, SIZE)
alias b_layout = Layout.row_major(SIZE, 1)


fn broadcast_add[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, a_layout],
    b: LayoutTensor[mut=False, dtype, b_layout],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p05/p05_layout_tensor.mojo

Tips
  1. Get 2D indices: row = thread_idx.y, col = thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard: think about how to broadcast values of a and b as LayoutTensors

Running the code

To test your solution, run the following command in your terminal:

uv run poe p05_layout_tensor
pixi run p05_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])

Solution

fn broadcast_add[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, a_layout],
    b: LayoutTensor[mut=False, dtype, b_layout],
    size: Int,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:
        out[row, col] = a[0, col] + b[row, 0]


This solution demonstrates key concepts of LayoutTensor broadcasting and GPU thread mapping:

  1. Thread to matrix mapping

    • Uses thread_idx.y for row access and thread_idx.x for column access
    • Natural 2D indexing matches the output matrix structure
    • Excess threads (3×3 grid) are handled by bounds checking
  2. Broadcasting mechanics

    • Input a has shape (1,n): a[0,col] broadcasts across rows
    • Input b has shape (n,1): b[row,0] broadcasts across columns
    • Output has shape (n,n): Each element is sum of corresponding broadcasts
    [ a0 a1 ]  +  [ b0 ]  =  [ a0+b0  a1+b0 ]
                  [ b1 ]     [ a0+b1  a1+b1 ]
    
  3. Bounds Checking

    • Guard condition row < size and col < size prevents out-of-bounds access
    • Handles both matrix bounds and excess threads efficiently
    • No need for separate checks for a and b due to broadcasting

This pattern forms the foundation for more complex tensor operations we’ll explore in later puzzles.

Puzzle 6: Blocks

Overview

Implement a kernel that adds 10 to each position of vector a and stores it in out.

Note: You have fewer threads per block than the size of a.

Blocks visualization

Key concepts

In this puzzle, you’ll learn about:

  • Processing data larger than thread block size
  • Coordinating multiple blocks of threads
  • Computing global thread positions

The key insight is understanding how blocks of threads work together to process data that’s larger than a single block’s capacity, while maintaining correct element-to-thread mapping.

Code to complete

alias SIZE = 9
alias BLOCKS_PER_GRID = (3, 1)
alias THREADS_PER_BLOCK = (4, 1)
alias dtype = DType.float32


fn add_10_blocks(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p06/p06.mojo

Tips
  1. Calculate global index: i = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if i < size
  3. Inside guard: out[i] = a[i] + 10.0

Running the code

To test your solution, run the following command in your terminal:

uv run poe p06
pixi run p06

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])

Solution

fn add_10_blocks(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    i = block_dim.x * block_idx.x + thread_idx.x
    if i < size:
        out[i] = a[i] + 10.0


This solution demonstrates key concepts of block-based GPU processing:

  1. Global thread indexing

    • Combines block and thread indices: block_dim.x * block_idx.x + thread_idx.x
    • Maps each thread to a unique global position
    • Example for 3 threads per block:
      Block 0: [0 1 2]
      Block 1: [3 4 5]
      Block 2: [6 7 8]
      
  2. Block coordination

    • Each block processes a contiguous chunk of data
    • Block size (3) < Data size (9) requires multiple blocks
    • Automatic work distribution across blocks:
      Data:    [0 1 2 3 4 5 6 7 8]
      Block 0: [0 1 2]
      Block 1:       [3 4 5]
      Block 2:             [6 7 8]
      
  3. Bounds checking

    • Guard condition i < size handles edge cases
    • Prevents out-of-bounds access when size isn’t perfectly divisible by block size
    • Essential for handling partial blocks at the end of data
  4. Memory access pattern

    • Coalesced memory access: threads in a block access contiguous memory
    • Each thread processes one element: out[i] = a[i] + 10.0
    • Block-level parallelism enables efficient memory bandwidth utilization

This pattern forms the foundation for processing large datasets that exceed the size of a single thread block.

Puzzle 7: 2D Blocks

Overview

Implement a kernel that adds 10 to each position of matrix a and stores it in out.

Note: You have fewer threads per block than the size of a in both directions.

Blocks 2D visualization

Key concepts

  • Block-based processing
  • Grid-block coordination
  • Multi-block indexing
  • Memory access patterns

🔑 2D thread indexing convention

We extend the block-based indexing from puzzle 04 to 2D:

Global position calculation:
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x

For example, with 2×2 blocks in a 4×4 grid:

Block (0,0):   Block (1,0):
[0,0  0,1]     [0,2  0,3]
[1,0  1,1]     [1,2  1,3]

Block (0,1):   Block (1,1):
[2,0  2,1]     [2,2  2,3]
[3,0  3,1]     [3,2  3,3]

Each position shows (row, col) for that thread’s global index. The block dimensions and indices work together to ensure:

  • Continuous coverage of the 2D space
  • No overlap between blocks
  • Efficient memory access patterns

Implementation approaches

🔰 Raw memory approach

Learn how to handle multi-block operations with manual indexing.

📐 LayoutTensor Version

Use LayoutTensor features to elegantly handle block-based processing.

💡 Note: See how LayoutTensor simplifies block coordination and memory access patterns.

Overview

Implement a kernel that adds 10 to each position of matrix a and stores it in out.

Note: You have fewer threads per block than the size of a in both directions.

Key concepts

In this puzzle, you’ll learn about:

  • Working with 2D block and thread arrangements
  • Handling matrix data larger than block size
  • Converting between 2D and linear memory access

The key insight is understanding how to coordinate multiple blocks of threads to process a 2D matrix that’s larger than a single block’s dimensions.

Configuration

  • Matrix size: \(5 \times 5\) elements
  • 2D blocks: Each block processes a \(3 \times 3\) region
  • Grid layout: Blocks arranged in \(2 \times 2\) grid
  • Total threads: \(36\) for \(25\) elements
  • Memory pattern: Row-major storage for 2D data
  • Coverage: Ensuring all matrix elements are processed

Code to complete

alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn add_10_blocks_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p07/p07.mojo

Tips
  1. Calculate global indices: row = block_dim.y * block_idx.y + thread_idx.y, col = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard: think about how to add 10 in row-major way!

Running the code

To test your solution, run the following command in your terminal:

uv run poe p07
pixi run p07

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])

Solution

fn add_10_blocks_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:
        out[row * size + col] = a[row * size + col] + 10.0


This solution demonstrates key concepts of 2D block-based processing with raw memory:

  1. 2D thread indexing

    • Global row: block_dim.y * block_idx.y + thread_idx.y
    • Global col: block_dim.x * block_idx.x + thread_idx.x
    • Maps thread grid to matrix elements:
      5×5 matrix with 3×3 blocks:
      
      Block (0,0)         Block (1,0)
      [(0,0) (0,1) (0,2)] [(0,3) (0,4)    *  ]
      [(1,0) (1,1) (1,2)] [(1,3) (1,4)    *  ]
      [(2,0) (2,1) (2,2)] [(2,3) (2,4)    *  ]
      
      Block (0,1)         Block (1,1)
      [(3,0) (3,1) (3,2)] [(3,3) (3,4)    *  ]
      [(4,0) (4,1) (4,2)] [(4,3) (4,4)    *  ]
      [  *     *     *  ] [  *     *      *  ]
      
      (* = thread exists but outside matrix bounds)
  2. Memory layout

    • Row-major linear memory: index = row * size + col
    • Example for 5×5 matrix:
      2D indices:    Linear memory:
      (2,1) -> 11   [00 01 02 03 04]
                    [05 06 07 08 09]
                    [10 11 12 13 14]
                    [15 16 17 18 19]
                    [20 21 22 23 24]
      
  3. Bounds checking

    • Guard row < size and col < size handles:
      • Excess threads in partial blocks
      • Edge cases at matrix boundaries
      • 2×2 block grid with 3×3 threads each = 36 threads for 25 elements
  4. Block coordination

    • Each 3×3 block processes part of 5×5 matrix
    • 2×2 grid of blocks ensures full coverage
    • Overlapping threads handled by bounds check
    • Efficient parallel processing across blocks

This pattern shows how to handle 2D data larger than block size while maintaining efficient memory access and thread coordination.

LayoutTensor Version

Overview

Implement a kernel that adds 10 to each position of 2D LayoutTensor a and stores it in 2D LayoutTensor out.

Note: You have fewer threads per block than the size of a in both directions.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor with multiple blocks
  • Handling large matrices with 2D block organization
  • Combining block indexing with LayoutTensor access

The key insight is that LayoutTensor simplifies 2D indexing while still requiring proper block coordination for large matrices.

Configuration

  • Matrix size: \(5 \times 5\) elements
  • Layout handling: LayoutTensor manages row-major organization
  • Block coordination: Multiple blocks cover the full matrix
  • 2D indexing: Natural \((i,j)\) access with bounds checking
  • Total threads: \(36\) for \(25\) elements
  • Thread mapping: Each thread processes one matrix element

Code to complete

alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(SIZE, SIZE)


fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, a_layout],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p07/p07_layout_tensor.mojo

Tips
  1. Calculate global indices: row = block_dim.y * block_idx.y + thread_idx.y, col = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if row < size and col < size
  3. Inside guard: think about how to add 10 to 2D LayoutTensor

Running the code

To test your solution, run the following command in your terminal:

uv run poe p07_layout_tensor
pixi run p07_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])

Solution

fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, a_layout],
    size: Int,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:
        out[row, col] = a[row, col] + 10.0


This solution demonstrates how LayoutTensor simplifies 2D block-based processing:

  1. 2D thread indexing

    • Global row: block_dim.y * block_idx.y + thread_idx.y
    • Global col: block_dim.x * block_idx.x + thread_idx.x
    • Maps thread grid to tensor elements:
      5×5 tensor with 3×3 blocks:
      
      Block (0,0)         Block (1,0)
      [(0,0) (0,1) (0,2)] [(0,3) (0,4)    *  ]
      [(1,0) (1,1) (1,2)] [(1,3) (1,4)    *  ]
      [(2,0) (2,1) (2,2)] [(2,3) (2,4)    *  ]
      
      Block (0,1)         Block (1,1)
      [(3,0) (3,1) (3,2)] [(3,3) (3,4)    *  ]
      [(4,0) (4,1) (4,2)] [(4,3) (4,4)    *  ]
      [  *     *     *  ] [  *     *      *  ]
      
      (* = thread exists but outside tensor bounds)
  2. LayoutTensor benefits

    • Natural 2D indexing: tensor[row, col] instead of manual offset calculation
    • Automatic memory layout optimization
    • Example access pattern:
      Raw memory:         LayoutTensor:
      row * size + col    tensor[row, col]
      (2,1) -> 11        (2,1) -> same element
      
  3. Bounds checking

    • Guard row < size and col < size handles:
      • Excess threads in partial blocks
      • Edge cases at tensor boundaries
      • Automatic memory layout handling by LayoutTensor
      • 36 threads (2×2 blocks of 3×3) for 25 elements
  4. Block coordination

    • Each 3×3 block processes part of 5×5 tensor
    • LayoutTensor handles:
      • Memory layout optimization
      • Efficient access patterns
      • Block boundary coordination
      • Cache-friendly data access

This pattern shows how LayoutTensor simplifies 2D block processing while maintaining optimal memory access patterns and thread coordination.

Puzzle 8: Shared Memory

Overview

Implement a kernel that adds 10 to each position of a vector a and stores it in vector out.

Note: You have fewer threads per block than the size of a.

Shared memory visualization

Implementation approaches

🔰 Raw memory approach

Learn how to manually manage shared memory and synchronization.

📐 LayoutTensor Version

Use LayoutTensor’s built-in shared memory management features.

💡 Note: Experience how LayoutTensor simplifies shared memory operations while maintaining performance.

Overview

Implement a kernel that adds 10 to each position of a vector a and stores it in out.

Note: You have fewer threads per block than the size of a.

Key concepts

In this puzzle, you’ll learn about:

  • Using shared memory within thread blocks
  • Synchronizing threads with barriers
  • Managing block-local data storage

The key insight is understanding how shared memory provides fast, block-local storage that all threads in a block can access, requiring careful coordination between threads.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 4
  • Number of blocks: 2
  • Shared memory: TPB elements per block

Notes:

  • Shared memory: Fast storage shared by threads in a block
  • Thread sync: Coordination using barrier()
  • Memory scope: Shared memory only visible within block
  • Access pattern: Local vs global indexing

Warning: Each block can only have a constant amount of shared memory that threads in that block can read and write to. This needs to be a literal python constant, not a variable. After writing to shared memory you need to call barrier to ensure that threads do not cross.

Code to complete

alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32


fn add_10_shared(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # local data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # wait for all threads to complete
    # works within a thread block
    barrier()

    # FILL ME IN (roughly 2 lines)


View full file: problems/p08/p08.mojo

Tips
  1. Wait for shared memory load with barrier()
  2. Use local_i to access shared memory: shared[local_i]
  3. Use global_i for output: out[global_i]
  4. Add guard: if global_i < size

Running the code

To test your solution, run the following command in your terminal:

uv run poe p08
pixi run p08

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

Solution

fn add_10_shared(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # local data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # wait for all threads to complete
    # works within a thread block
    barrier()

    # process using shared memory
    if global_i < size:
        out[global_i] = shared[local_i] + 10


This solution demonstrates key concepts of shared memory usage in GPU programming:

  1. Memory hierarchy

    • Global memory: a and out arrays (slow, visible to all blocks)
    • Shared memory: shared array (fast, thread-block local)
    • Example for 8 elements with 4 threads per block:
      Global array a: [1 1 1 1 | 1 1 1 1]  # Input: all ones
      
      Block (0):      Block (1):
      shared[0..3]    shared[0..3]
      [1 1 1 1]       [1 1 1 1]
      
  2. Thread coordination

    • Load phase:
      Thread 0: shared[0] = a[0]=1    Thread 2: shared[2] = a[2]=1
      Thread 1: shared[1] = a[1]=1    Thread 3: shared[3] = a[3]=1
      barrier()    ↓         ↓        ↓         ↓   # Wait for all loads
      
    • Process phase: Each thread adds 10 to its shared memory value
    • Result: out[i] = shared[local_i] + 10 = 11
  3. Index mapping

    • Global index: block_dim.x * block_idx.x + thread_idx.x
      Block 0 output: [11 11 11 11]
      Block 1 output: [11 11 11 11]
      
    • Local index: thread_idx.x for shared memory access
      Both blocks process: 1 + 10 = 11
      
  4. Memory access pattern

    • Load: Global → Shared (coalesced reads of 1s)
    • Sync: barrier() ensures all loads complete
    • Process: Add 10 to shared values
    • Store: Write 11s back to global memory

This pattern shows how to use shared memory to optimize data access while maintaining thread coordination within blocks.

Overview

Implement a kernel that adds 10 to each position of a 1D ayoutTensor a and stores it in 1D LayoutTensor out.

Note: You have fewer threads per block than the size of a.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor’s shared memory features
  • Thread synchronization with shared memory
  • Block-local data management with tensor builder

The key insight is how LayoutTensor simplifies shared memory management while maintaining the performance benefits of block-local storage.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 4
  • Number of blocks: 2
  • Shared memory: TPB elements per block

Key differences from raw approach

  1. Memory allocation: We will use LayoutTensorBuild instead of stack_allocation

    # Raw approach
    shared = stack_allocation[TPB, Scalar[dtype]]()
    
    # LayoutTensor approach
    shared = LayoutTensorBuild[dtype]().row_major[TPB]().shared().alloc()
    
  2. Memory access: Same syntax

    # Raw approach
    shared[local_i] = a[global_i]
    
    # LayoutTensor approach
    shared[local_i] = a[global_i]
    
  3. Safety features:

    • Type safety
    • Layout management
    • Memory alignment handling

Note: LayoutTensor handles memory layout, but you still need to manage thread synchronization with barrier() when using shared memory.

Code to complete

alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn add_10_shared_layout_tensor[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # FILL ME IN (roughly 2 lines)


View full file: problems/p08/p08_layout_tensor.mojo

Tips
  1. Create shared memory with tensor builder
  2. Load data with natural indexing: shared[local_i] = a[global_i]
  3. Synchronize with barrier()
  4. Process data using shared memory indices
  5. Guard against out-of-bounds access

Running the code

To test your solution, run the following command in your terminal:

uv run poe p08_layout_tensor
pixi run p08_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

Solution

fn add_10_shared_layout_tensor[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    if global_i < size:
        out[global_i] = shared[local_i] + 10


This solution demonstrates how LayoutTensor simplifies shared memory usage while maintaining performance:

  1. Memory hierarchy with LayoutTensor

    • Global tensors: a and out (slow, visible to all blocks)
    • Shared tensor: shared (fast, thread-block local)
    • Example for 8 elements with 4 threads per block:
      Global tensor a: [1 1 1 1 | 1 1 1 1]  # Input: all ones
      
      Block (0):         Block (1):
      shared[0..3]       shared[0..3]
      [1 1 1 1]          [1 1 1 1]
      
  2. Thread coordination

    • Load phase with natural indexing:
      Thread 0: shared[0] = a[0]=1    Thread 2: shared[2] = a[2]=1
      Thread 1: shared[1] = a[1]=1    Thread 3: shared[3] = a[3]=1
      barrier()    ↓         ↓        ↓         ↓   # Wait for all loads
      
    • Process phase: Each thread adds 10 to its shared tensor value
    • Result: out[global_i] = shared[local_i] + 10 = 11
  3. LayoutTensor benefits

    • Shared memory allocation:
      # Clean tensor builder API
      shared = tb[dtype]().row_major[TPB]().shared().alloc()
      
    • Natural indexing for both global and shared:
      Block 0 output: [11 11 11 11]
      Block 1 output: [11 11 11 11]
      
    • Built-in layout management and type safety
  4. Memory access pattern

    • Load: Global tensor → Shared tensor (optimized)
    • Sync: Same barrier() requirement as raw version
    • Process: Add 10 to shared values
    • Store: Write 11s back to global tensor

This pattern shows how LayoutTensor maintains the performance benefits of shared memory while providing a more ergonomic API and built-in features.

Puzzle 9: Pooling

Overview

Implement a kernel that compute the running sum of the last 3 positions of vector a and stores it in vector out.

Note: You have 1 thread per position. You only need 1 global read and 1 global write per thread.

Pooling visualization

Implementation approaches

🔰 Raw memory approach

Learn how to implement sliding window operations with manual memory management and synchronization.

📐 LayoutTensor Version

Use LayoutTensor’s features for efficient window-based operations and shared memory management.

💡 Note: See how LayoutTensor simplifies sliding window operations while maintaining efficient memory access patterns.

Overview

Implement a kernel that compute the running sum of the last 3 positions of vector a and stores it in vector out.

Note: You have 1 thread per position. You only need 1 global read and 1 global write per thread.

Key concepts

In this puzzle, you’ll learn about:

  • Using shared memory for sliding window operations
  • Handling boundary conditions in pooling
  • Coordinating thread access to neighboring elements

The key insight is understanding how to efficiently access a window of elements using shared memory, with special handling for the first elements in the sequence.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Window size: 3 elements
  • Shared memory: TPB elements

Notes:

  • Window access: Each output depends on up to 3 previous elements
  • Edge handling: First two positions need special treatment
  • Memory pattern: One shared memory load per thread
  • Thread sync: Coordination before window operations

Code to complete

alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32


fn pooling(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 10 lines)


View full file: problems/p09/p09.mojo

Tips
  1. Load data and call barrier()
  2. Special cases: out[0] = shared[0], out[1] = shared[0] + shared[1]
  3. General case: if 1 < global_i < size
  4. Sum three elements: shared[local_i - 2] + shared[local_i - 1] + shared[local_i]

Running the code

To test your solution, run the following command in your terminal:

uv run poe p09
pixi run p09

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])

Solution

fn pooling(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    if global_i == 0:
        out[0] = shared[0]
    elif global_i == 1:
        out[1] = shared[0] + shared[1]
    elif 1 < global_i < size:
        out[global_i] = (
            shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
        )


The solution implements a sliding window sum using shared memory with these key steps:

  1. Shared memory setup

    • Allocates TPB elements in shared memory:
      Input array:  [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      
    • Each thread loads one element from global memory
    • barrier() ensures all data is loaded
  2. Boundary cases

    • Position 0: Single element
      out[0] = shared[0] = 0.0
      
    • Position 1: Sum of first two elements
      out[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0
      
  3. Main window operation

    • For positions 2 and beyond:
      Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0
      Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0
      Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0
      ...
      
    • Window calculation using local indices:
      # Sliding window of 3 elements
      window_sum = shared[i-2] + shared[i-1] + shared[i]
      
  4. Memory access pattern

    • One global read per thread into shared memory
    • One global write per thread from shared memory
    • Uses shared memory for efficient neighbor access
    • Maintains coalesced memory access pattern

This approach optimizes performance through:

  • Minimal global memory access
  • Fast shared memory neighbor lookups
  • Clean boundary handling
  • Efficient memory coalescing

The final output shows the cumulative window sums:

[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]

Overview

Implement a kernel that compute the running sum of the last 3 positions of 1D LayoutTensor a and stores it in 1D LayoutTensor out.

Note: You have 1 thread per position. You only need 1 global read and 1 global write per thread.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor for sliding window operations
  • Managing shared memory with LayoutTensorBuilder that we saw in puzzle_08
  • Efficient neighbor access patterns
  • Boundary condition handling

The key insight is how LayoutTensor simplifies shared memory management while maintaining efficient window-based operations.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Window size: 3 elements
  • Shared memory: TPB elements

Notes:

  • Tensor builder: Use LayoutTensorBuilder[dtype]().row_major[TPB]().shared().alloc()
  • Window access: Natural indexing for 3-element windows
  • Edge handling: Special cases for first two positions
  • Memory pattern: One shared memory load per thread

Code to complete

alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn pooling[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FIX ME IN (roughly 10 lines)


View full file: problems/p09/p09_layout_tensor.mojo

Tips
  1. Create shared memory with tensor builder
  2. Load data with natural indexing: shared[local_i] = a[global_i]
  3. Handle special cases for first two elements
  4. Use shared memory for window operations
  5. Guard against out-of-bounds access

Running the code

To test your solution, run the following command in your terminal:

uv run poe p09_layout_tensor
pixi run p09_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])

Solution

fn pooling[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Load data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # Synchronize threads within block
    barrier()

    # Handle first two special cases
    if global_i == 0:
        out[0] = shared[0]
    elif global_i == 1:
        out[1] = shared[0] + shared[1]
    # Handle general case
    elif 1 < global_i < size:
        out[global_i] = (
            shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
        )


The solution implements a sliding window sum using LayoutTensor with these key steps:

  1. Shared memory setup

    • Tensor builder creates block-local storage:
      shared = tb[dtype]().row_major[TPB]().shared().alloc()
      
    • Each thread loads one element:
      Input array:  [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      
    • barrier() ensures all data is loaded
  2. Boundary cases

    • Position 0: Single element
      out[0] = shared[0] = 0.0
      
    • Position 1: Sum of first two elements
      out[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0
      
  3. Main window operation

    • For positions 2 and beyond:
      Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0
      Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0
      Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0
      ...
      
    • Natural indexing with LayoutTensor:
      # Sliding window of 3 elements
      window_sum = shared[i-2] + shared[i-1] + shared[i]
      
  4. Memory access pattern

    • One global read per thread into shared tensor
    • Efficient neighbor access through shared memory
    • LayoutTensor benefits:
      • Automatic bounds checking
      • Natural window indexing
      • Layout-aware memory access
      • Type safety throughout

This approach combines the performance of shared memory with LayoutTensor’s safety and ergonomics:

  • Minimizes global memory access
  • Simplifies window operations
  • Handles boundaries cleanly
  • Maintains coalesced access patterns

The final output shows the cumulative window sums:

[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]

Puzzle 10: Dot Product

Overview

Implement a kernel that computes the dot-product of vector a and vector b and stores it in out (single number).

Note: You have 1 thread per position. You only need 2 global reads per thread and 1 global write per thread block.

Dot product visualization

Implementation approaches

🔰 Raw memory approach

Learn how to implement the reduction with manual memory management and synchronization.

📐 LayoutTensor Version

Use LayoutTensor’s features for efficient reduction and shared memory management.

💡 Note: See how LayoutTensor simplifies efficient memory access patterns.

Overview

Implement a kernel that computes the dot-product of vector a and vector b and stores it in out (single number).

Note: You have 1 thread per position. You only need 2 global reads per thread and 1 global write per thread block.

Key concepts

In this puzzle, you’ll learn about:

  • Implementing parallel reduction operations
  • Using shared memory for intermediate results
  • Coordinating threads for collective operations

The key insight is understanding how to efficiently combine multiple values into a single result using parallel computation and shared memory.

Configuration

  • Vector size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Number of blocks: 1
  • Output size: 1 element
  • Shared memory: TPB elements

Notes:

  • Element access: Each thread reads corresponding elements from a and b
  • Partial results: Computing and storing intermediate values
  • Thread coordination: Synchronizing before combining results
  • Final reduction: Converting partial results to scalar output

Note: For this problem, you don’t need to worry about number of shared reads. We will handle that challenge later.

Code to complete

alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (SIZE, 1)
alias dtype = DType.float32


fn dot_product(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    # FILL ME IN (roughly 13 lines)
    ...


View full file: problems/p10/p10.mojo

Tips
  1. Store a[global_i] * b[global_i] in shared[local_i]
  2. Call barrier() to synchronize
  3. Use thread 0 to sum all products in shared memory
  4. Write final sum to out[0]

Running the code

To test your solution, run the following command in your terminal:

uv run poe p10
pixi run p10

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0])
expected: HostBuffer([140.0])

Solution

fn dot_product(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    if global_i < size:
        shared[local_i] = a[global_i] * b[global_i]

    barrier()

    # The following causes race condition: all threads writing to the same location
    # out[0] += shared[local_i]

    # Instead can do parallel reduction in shared memory as opposed to
    # global memory which has no guarantee on synchronization.
    # Loops using global memory can cause thread divergence because
    # fundamentally GPUs execute threads in warps (groups of 32 threads typically)
    # and warps can be scheduled independently.
    # However, shared memory does not have such issues as long as we use `barrier()`
    # correctly when we're in the same thread block.
    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]

        barrier()
        stride //= 2

    # only thread 0 writes the final result
    if local_i == 0:
        out[0] = shared[0]


The solution implements a parallel reduction algorithm for dot product computation using shared memory. Here’s a detailed breakdown:

Phase 1: Element-wise Multiplication

Each thread performs one multiplication:

Thread i: shared[i] = a[i] * b[i]

Phase 2: Parallel Reduction

The reduction uses a tree-based approach that halves active threads in each step:

Initial:  [0*0  1*1  2*2  3*3  4*4  5*5  6*6  7*7]
        = [0    1    4    9    16   25   36   49]

Step 1:   [0+16 1+25 4+36 9+49  16   25   36   49]
        = [16   26   40   58   16   25   36   49]

Step 2:   [16+40 26+58 40   58   16   25   36   49]
        = [56   84   40   58   16   25   36   49]

Step 3:   [56+84  84   40   58   16   25   36   49]
        = [140   84   40   58   16   25   36   49]

Key Implementation Features:

  1. Memory Access Pattern:

    • Each thread loads exactly two values from global memory (a[i], b[i])
    • Uses shared memory for intermediate results
    • Final result written once to global memory
  2. Thread Synchronization:

    • barrier() after initial multiplication
    • barrier() after each reduction step
    • Prevents race conditions between reduction steps
  3. Reduction Logic:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2
    
    • Halves stride in each step
    • Only active threads perform additions
    • Maintains work efficiency
  4. Performance Considerations:

    • \(\log_2(n)\) steps for \(n\) elements
    • Coalesced memory access pattern
    • Minimal thread divergence
    • Efficient use of shared memory

This implementation achieves \(O(\log n)\) time complexity compared to \(O(n)\) in sequential execution, demonstrating the power of parallel reduction algorithms.

Barrier Synchronization Importance

The barrier() between reduction steps is critical for correctness. Here’s why:

Without barrier(), race conditions occur:

Initial shared memory: [0 1 4 9 16 25 36 49]

Step 1 (stride = 4):
Thread 0 reads: shared[0] = 0, shared[4] = 16
Thread 1 reads: shared[1] = 1, shared[5] = 25
Thread 2 reads: shared[2] = 4, shared[6] = 36
Thread 3 reads: shared[3] = 9, shared[7] = 49

Without barrier:
- Thread 0 writes: shared[0] = 0 + 16 = 16
- Thread 1 starts next step (stride = 2) before Thread 0 finishes
  and reads old value shared[0] = 0 instead of 16!

With barrier():

Step 1 (stride = 4):
All threads write their sums:
[16 26 40 58 16 25 36 49]
barrier() ensures ALL threads see these values

Step 2 (stride = 2):
Now threads safely read the updated values:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84

The barrier() ensures:

  1. All writes from current step complete
  2. All threads see updated values
  3. No thread starts next iteration early
  4. Consistent shared memory state

Without these synchronization points, we could get:

  • Memory race conditions
  • Threads reading stale values
  • Non-deterministic results
  • Incorrect final sum

Overview

Implement a kernel that computes the dot-product of 1D LayoutTensor a and 1D LayoutTensor b and stores it in 1D LayoutTensor out (single number).

Note: You have 1 thread per position. You only need 2 global reads per thread and 1 global write per thread block.

Key concepts

In this puzzle, you’ll learn about:

  • Similar to the puzzle 8 and puzzle 9, implementing parallel reduction with LayoutTensor
  • Managing shared memory using LayoutTensorBuilder
  • Coordinating threads for collective operations
  • Using layout-aware tensor operations

The key insight is how LayoutTensor simplifies memory management while maintaining efficient parallel reduction patterns.

Configuration

  • Vector size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Number of blocks: 1
  • Output size: 1 element
  • Shared memory: TPB elements

Notes:

  • Tensor builder: Use LayoutTensorBuilder[dtype]().row_major[TPB]().shared().alloc()
  • Element access: Natural indexing with bounds checking
  • Layout handling: Separate layouts for input and output
  • Thread coordination: Same synchronization patterns with barrier()

Code to complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb


alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (SIZE, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(1)


fn dot_product[
    in_layout: Layout, out_layout: Layout
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=True, dtype, in_layout],
    b: LayoutTensor[mut=True, dtype, in_layout],
    size: Int,
):
    # FILL ME IN (roughly 13 lines)
    ...


View full file: problems/p10/p10_layout_tensor.mojo

Tips
  1. Create shared memory with tensor builder
  2. Store a[global_i] * b[global_i] in shared[local_i]
  3. Use parallel reduction pattern with barrier()
  4. Let thread 0 write final result to out[0]

Running the code

To test your solution, run the following command in your terminal:

uv run poe p10_layout_tensor
pixi run p10_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0])
expected: HostBuffer([140.0])

Solution

fn dot_product[
    in_layout: Layout, out_layout: Layout
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=True, dtype, in_layout],
    b: LayoutTensor[mut=True, dtype, in_layout],
    size: Int,
):
    shared = tb[dtype]().row_major[TPB]().shared().alloc()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Compute element-wise multiplication into shared memory
    if global_i < size:
        shared[local_i] = a[global_i] * b[global_i]

    # Synchronize threads within block
    barrier()

    # Parallel reduction in shared memory
    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]

        barrier()
        stride //= 2

    # Only thread 0 writes the final result
    if local_i == 0:
        out[0] = shared[0]


The solution implements a parallel reduction for dot product using LayoutTensor. Here’s the detailed breakdown:

Phase 1: Element-wise Multiplication

Each thread performs one multiplication with natural indexing:

shared[local_i] = a[global_i] * b[global_i]

Phase 2: Parallel Reduction

Tree-based reduction with layout-aware operations:

Initial:  [0*0  1*1  2*2  3*3  4*4  5*5  6*6  7*7]
        = [0    1    4    9    16   25   36   49]

Step 1:   [0+16 1+25 4+36 9+49  16   25   36   49]
        = [16   26   40   58   16   25   36   49]

Step 2:   [16+40 26+58 40   58   16   25   36   49]
        = [56   84   40   58   16   25   36   49]

Step 3:   [56+84  84   40   58   16   25   36   49]
        = [140   84   40   58   16   25   36   49]

Key Implementation Features:

  1. Memory Management:

    • Clean shared memory allocation with tensor builder
    • Type-safe operations with LayoutTensor
    • Automatic bounds checking
    • Layout-aware indexing
  2. Thread Synchronization:

    • barrier() after initial multiplication
    • barrier() between reduction steps
    • Safe thread coordination
  3. Reduction Logic:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2
    
  4. Performance Benefits:

    • \(O(\log n)\) time complexity
    • Coalesced memory access
    • Minimal thread divergence
    • Efficient shared memory usage

The LayoutTensor version maintains the same efficient parallel reduction while providing:

  • Better type safety
  • Cleaner memory management
  • Layout awareness
  • Natural indexing syntax

Barrier Synchronization Importance

The barrier() between reduction steps is critical for correctness. Here’s why:

Without barrier(), race conditions occur:

Initial shared memory: [0 1 4 9 16 25 36 49]

Step 1 (stride = 4):
Thread 0 reads: shared[0] = 0, shared[4] = 16
Thread 1 reads: shared[1] = 1, shared[5] = 25
Thread 2 reads: shared[2] = 4, shared[6] = 36
Thread 3 reads: shared[3] = 9, shared[7] = 49

Without barrier:
- Thread 0 writes: shared[0] = 0 + 16 = 16
- Thread 1 starts next step (stride = 2) before Thread 0 finishes
  and reads old value shared[0] = 0 instead of 16!

With barrier():

Step 1 (stride = 4):
All threads write their sums:
[16 26 40 58 16 25 36 49]
barrier() ensures ALL threads see these values

Step 2 (stride = 2):
Now threads safely read the updated values:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84

The barrier() ensures:

  1. All writes from current step complete
  2. All threads see updated values
  3. No thread starts next iteration early
  4. Consistent shared memory state

Without these synchronization points, we could get:

  • Memory race conditions
  • Threads reading stale values
  • Non-deterministic results
  • Incorrect final sum

Puzzle 11: 1D Convolution

Moving to LayoutTensor

So far in our GPU puzzle journey, we’ve been exploring two parallel approaches to GPU memory management:

  1. Raw memory management with direct pointer manipulation using UnsafePointer
  2. The more structured LayoutTensor and its related abstractions such as LayoutTensorBuild

Starting from this puzzle, we’re transitioning exclusively to using LayoutTensor. This abstraction provides several benefits:

  • Type-safe memory access patterns
  • Clear representation of data layouts
  • Better code maintainability
  • Reduced chance of memory-related bugs
  • More expressive code that better represents the underlying computations
  • A lot more … that we’ll uncover gradually!

This transition aligns with best practices in modern GPU programming in Mojo 🔥, where higher-level abstractions help manage complexity without sacrificing performance.

Overview

In signal processing and image analysis, convolution is a fundamental operation that combines two sequences to produce a third sequence. This puzzle challenges you to implement a 1D convolution on the GPU, where each output element is computed by sliding a kernel over an input array.

Implement a kernel that computes a 1D convolution between vector a and vector b and stores it in out using the LayoutTensor abstraction.

Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.

1D Convolution

For those new to convolution, think of it as a weighted sliding window operation. At each position, we multiply the kernel values with the corresponding input values and sum the results. In mathematical notation, this is often written as:

\[\Large out[i] = \sum_{j=0}^{\text{CONV}-1} a[i+j] \cdot b[j] \]

In pseudocode, 1D convolution is:

for i in range(SIZE):
    for j in range(CONV):
        if i + j < SIZE:
            ret[i] += a_host[i + j] * b_host[j]

This puzzle is split into two parts to help you build understanding progressively:

  • Simple Version with Single Block Start here to learn the basics of implementing convolution with shared memory in a single block using LayoutTensor.

  • Block Boundary Version Then tackle the more challenging case where data needs to be shared across block boundaries, leveraging LayoutTensor’s capabilities.

Each version presents unique challenges in terms of memory access patterns and thread coordination. The simple version helps you understand the basic convolution operation, while the complete version tests your ability to handle more complex scenarios that arise in real-world GPU programming.

Simple Case with Single Block

Implement a kernel that computes a 1D convolution between 1D LayoutTensor a and 1D LayoutTensor b and stores it in 1D LayoutTensor out.

Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.

Key concepts

In this puzzle, you’ll learn about:

  • Implementing sliding window operations on GPUs
  • Managing data dependencies across threads
  • Using shared memory for overlapping regions

The key insight is understanding how to efficiently access overlapping elements while maintaining correct boundary conditions.

Configuration

  • Input array size: SIZE = 6 elements
  • Kernel size: CONV = 3 elements
  • Threads per block: TPB = 8
  • Number of blocks: 1
  • Shared memory: Two arrays of size SIZE and CONV

Notes:

  • Data loading: Each thread loads one element from input and kernel
  • Memory pattern: Shared arrays for input and convolution kernel
  • Thread sync: Coordination before computation

Code to complete

alias TPB = 8
alias SIZE = 6
alias CONV = 3
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias in_layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(SIZE)
alias conv_layout = Layout.row_major(CONV)


fn conv_1d_simple[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 14 lines)


View full file: problems/p11/p11.mojo

Tips
  1. Use tb[dtype]().row_major[SIZE]().shared().alloc() for shared memory allocation
  2. Load input to shared_a[local_i] and kernel to shared_b[local_i]
  3. Call barrier() after loading
  4. Sum products within bounds: if local_i + j < SIZE
  5. Write result if global_i < a_size

Running the code

To test your solution, run the following command in your terminal:

uv run poe p11 --simple
pixi run p11 --simple

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])

Solution

fn conv_1d_simple[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared_a = tb[dtype]().row_major[SIZE]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
    if global_i < SIZE:
        shared_a[local_i] = a[global_i]

    if global_i < CONV:
        shared_b[local_i] = b[global_i]

    barrier()

    # Note: this is unsafe as it enforces no guard so could access `shared_a` beyond its bounds
    # local_sum = Scalar[dtype](0)
    # for j in range(CONV):
    #     if local_i + j < SIZE:
    #         local_sum += shared_a[local_i + j] * shared_b[j]

    # if global_i < SIZE:
    #     out[global_i] = local_sum

    # Safe and correct:
    if global_i < SIZE:
        # Note: using `var` allows us to include the type in the type inference
        # `out.element_type` is available in LayoutTensor
        var local_sum: out.element_type = 0

        # Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant
        # See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement
        @parameter
        for j in range(CONV):
            # Bonus: do we need this check for this specific example with fixed SIZE, CONV
            if local_i + j < SIZE:
                local_sum += shared_a[local_i + j] * shared_b[j]

        out[global_i] = local_sum


The solution implements a 1D convolution using shared memory for efficient access to overlapping elements. Here’s a detailed breakdown:

Memory Layout

Input array a:   [0  1  2  3  4  5]
Kernel b:        [0  1  2]

Computation Steps

  1. Data Loading:

    shared_a: [0  1  2  3  4  5]  // Input array
    shared_b: [0  1  2]           // Convolution kernel
    
  2. Convolution Process for each position i:

    out[0] = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] = 0*0 + 1*1 + 2*2 = 5
    out[1] = a[1]*b[0] + a[2]*b[1] + a[3]*b[2] = 1*0 + 2*1 + 3*2 = 8
    out[2] = a[2]*b[0] + a[3]*b[1] + a[4]*b[2] = 2*0 + 3*1 + 4*2 = 11
    out[3] = a[3]*b[0] + a[4]*b[1] + a[5]*b[2] = 3*0 + 4*1 + 5*2 = 14
    out[4] = a[4]*b[0] + a[5]*b[1] + 0*b[2]    = 4*0 + 5*1 + 0*2 = 5
    out[5] = a[5]*b[0] + 0*b[1]   + 0*b[2]     = 5*0 + 0*1 + 0*2 = 0
    

Implementation Details

  1. Memory Safety Considerations:

    • The naive approach without proper bounds checking could be unsafe:

      # Unsafe version - could access shared_a beyond its bounds
      local_sum = Scalar[dtype](0)
      for j in range(CONV):
          if local_i + j < SIZE:
              local_sum += shared_a[local_i + j] * shared_b[j]
      
    • The safe and correct implementation:

      if global_i < a_size:
          var local_sum: out.element_type = 0  # Using var allows type inference
          @parameter  # Unrolls loop at compile time since CONV is constant
          for j in range(CONV):
              if local_i + j < SIZE:
                  local_sum += shared_a[local_i + j] * shared_b[j]
      
  2. Key Implementation Features:

    • Uses var for proper type inference with out.element_type
    • Employs @parameter decorator to unroll the convolution loop at compile time
    • Maintains strict bounds checking for memory safety
    • Leverages LayoutTensor’s type system for better code safety
  3. Memory Management:

    • Uses shared memory for both input array and kernel
    • Single load per thread from global memory
    • Efficient reuse of loaded data
  4. Thread Coordination:

    • barrier() ensures all data is loaded before computation
    • Each thread computes one output element
    • Maintains coalesced memory access pattern
  5. Performance Optimizations:

    • Minimizes global memory access
    • Uses shared memory for fast data access
    • Avoids thread divergence in main computation loop
    • Loop unrolling through @parameter decorator

Block Boundary Version

Implement a kernel that computes a 1D convolution between 1D LayoutTensor a and 1D LayoutTensor b and stores it in 1D LayoutTensor out.

Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.

Configuration

  • Input array size: SIZE_2 = 15 elements
  • Kernel size: CONV_2 = 4 elements
  • Threads per block: TPB = 8
  • Number of blocks: 2
  • Shared memory: TPB + CONV_2 - 1 elements for input

Notes:

  • Extended loading: Account for boundary overlap
  • Block edges: Handle data across block boundaries
  • Memory layout: Efficient shared memory usage
  • Synchronization: Proper thread coordination

Code to complete

alias SIZE_2 = 15
alias CONV_2 = 4
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (TPB, 1)
alias in_2_layout = Layout.row_major(SIZE_2)
alias out_2_layout = Layout.row_major(SIZE_2)
alias conv_2_layout = Layout.row_major(CONV_2)


fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 18 lines)


View full file: problems/p11/p11.mojo

Tips
  1. Use tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc() for shared memory
  2. Load main data: shared_a[local_i] = a[global_i]
  3. Load boundary: if local_i < CONV_2 - 1 handle next block data
  4. Load kernel: shared_b[local_i] = b[local_i]
  5. Sum within extended bounds: if local_i + j < TPB + CONV_2 - 1

Running the code

To test your solution, run the following command in your terminal:

uv run poe p11 --block-boundary
pixi run p11 --block-boundary

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([14.0, 20.0, 26.0, 32.0, 38.0, 44.0, 50.0, 56.0, 62.0, 68.0, 74.0, 80.0, 41.0, 14.0, 0.0])

Solution

fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # first: need to account for padding
    shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()
    if global_i < SIZE_2:
        shared_a[local_i] = a[global_i]

    # second: load elements needed for convolution at block boundary
    if local_i < CONV_2 - 1:
        # indices from next block
        next_idx = global_i + TPB
        if next_idx < SIZE_2:
            shared_a[TPB + local_i] = a[next_idx]
        else:
            # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
            # which is an undefined behavior
            shared_a[TPB + local_i] = 0

    if local_i < CONV_2:
        shared_b[local_i] = b[local_i]

    barrier()

    if global_i < SIZE_2:
        var local_sum: out.element_type = 0

        @parameter
        for j in range(CONV_2):
            if local_i + j < TPB + CONV_2 - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]

        out[global_i] = local_sum


The solution handles block boundary cases in 1D convolution using extended shared memory. Here’s a detailed analysis:

Memory layout and sizing

Test Configuration:
- Full array size: SIZE_2 = 15 elements
- Grid: 2 blocks × 8 threads
- Convolution kernel: CONV_2 = 4 elements

Block 0 shared memory:  [0 1 2 3 4 5 6 7|8 9 10]  // TPB(8) + (CONV_2-1)(3) padding
Block 1 shared memory:  [8 9 10 11 12 13 14|0 0]  // Second block with padding

Size calculation:
- Main data: TPB elements (8)
- Overlap: CONV_2 - 1 elements (4 - 1 = 3)
- Total: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11 elements

Implementation details

  1. Shared Memory Allocation:

    # First: account for padding needed for convolution window
    shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()
    

    This allocation pattern ensures we have enough space for both the block’s data and the overlap region.

  2. Data Loading Strategy:

    # Main block data
    if global_i < a_size:
        shared_a[local_i] = a[global_i]
    
    # Boundary data from next block
    if local_i < CONV_2 - 1:
        next_idx = global_i + TPB
        if next_idx < a_size:
            shared_a[TPB + local_i] = a[next_idx]
        else:
            # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
            # which is an undefined behavior
            shared_a[TPB + local_i] = 0
    
    • Only threads with local_i < CONV_2 - 1 load boundary data
    • Prevents unnecessary thread divergence
    • Maintains memory coalescing for main data load
    • Explicitly zeroes out-of-bounds elements to avoid undefined behavior
  3. Kernel Loading:

    if local_i < b_size:
        shared_b[local_i] = b[local_i]
    
    • Single load per thread
    • Bounded by kernel size
  4. Convolution Computation:

    if global_i < a_size:
        var local_sum: out.element_type = 0
        @parameter
        for j in range(CONV_2):
            if local_i + j < TPB + CONV_2 - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]
    
    • Uses @parameter for compile-time loop unrolling
    • Proper type inference with out.element_type
    • Extended bounds check for overlap region

Memory access pattern analysis

  1. Block 0 Access Pattern:

    Thread 0: [0 1 2 3] × [0 1 2 3]
    Thread 1: [1 2 3 4] × [0 1 2 3]
    Thread 2: [2 3 4 5] × [0 1 2 3]
    ...
    Thread 7: [7 8 9 10] × [0 1 2 3]  // Uses overlap data
    
  2. Block 1 Access Pattern:

    Thread 0: [8 9 10 11] × [0 1 2 3]
    Thread 1: [9 10 11 12] × [0 1 2 3]
    ...
    Thread 7: [14 0 0 0] × [0 1 2 3]  // Zero padding at end
    

Performance optimizations

  1. Memory Coalescing:

    • Main data load: Consecutive threads access consecutive memory
    • Boundary data: Only necessary threads participate
    • Single barrier synchronization point
  2. Thread Divergence Minimization:

    • Clean separation of main and boundary loading
    • Uniform computation pattern within warps
    • Efficient bounds checking
  3. Shared Memory Usage:

    • Optimal sizing to handle block boundaries
    • No bank conflicts in access pattern
    • Efficient reuse of loaded data
  4. Boundary Handling:

    • Explicit zero initialization for out-of-bounds elements which prevents reading from uninitialized shared memory
    • Proper handling of edge cases

This implementation achieves efficient cross-block convolution while maintaining:

  • Memory safety through proper bounds checking
  • High performance through optimized memory access
  • Clean code structure using LayoutTensor abstractions
  • Minimal synchronization overhead

Puzzle 12: Prefix Sum

Overview

Prefix sum (also known as scan) is a fundamental parallel algorithm that computes running totals of a sequence. Found at the heart of many parallel applications - from sorting algorithms to scientific simulations - it transforms a sequence of numbers into their running totals. While simple to compute sequentially, making this efficient on a GPU requires clever parallel thinking!

Implement a kernel that computes a prefix-sum over 1D LayoutTensor a and stores it in 1D LayoutTensor out.

Note: If the size of a is greater than the block size, only store the sum of each block.

Prefix sum

Key concepts

In this puzzle, you’ll learn about:

  • Parallel algorithms with logarithmic complexity
  • Shared memory coordination patterns
  • Multi-phase computation strategies

The key insight is understanding how to transform a sequential operation into an efficient parallel algorithm using shared memory.

For example, given an input sequence \([3, 1, 4, 1, 5, 9]\), the prefix sum would produce:

  • \([3]\) (just the first element)
  • \([3, 4]\) (3 + 1)
  • \([3, 4, 8]\) (previous sum + 4)
  • \([3, 4, 8, 9]\) (previous sum + 1)
  • \([3, 4, 8, 9, 14]\) (previous sum + 5)
  • \([3, 4, 8, 9, 14, 23]\) (previous sum + 9)

Mathematically, for a sequence \([x_0, x_1, …, x_n]\), the prefix sum produces: \[ [x_0, x_0+x_1, x_0+x_1+x_2, …, \sum_{i=0}^n x_i] \]

While a sequential algorithm would need \(O(n)\) steps, our parallel approach will use a clever two-phase algorithm that completes in \(O(\log n)\) steps! Here’s a visualization of this process:

This puzzle is split into two parts to help you master the concept:

  • Simple Version Start with a single block implementation where all data fits in shared memory. This helps understand the core parallel algorithm.

  • Complete Version Then tackle the more challenging case of handling larger arrays that span multiple blocks, requiring coordination between blocks.

Each version builds on the previous one, helping you develop a deep understanding of parallel prefix sum computation. The simple version establishes the fundamental algorithm, while the complete version shows how to scale it to larger datasets - a common requirement in real-world GPU applications.

Simple Version

Implement a kernel that computes a prefix-sum over 1D LayoutTensor a and stores it in 1D LayoutTensor out.

Note: If the size of a is greater than the block size, only store the sum of each block.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Number of blocks: 1
  • Shared memory: TPB elements

Notes:

  • Data loading: Each thread loads one element using LayoutTensor access
  • Memory pattern: Shared memory for intermediate results using LayoutTensorBuild
  • Thread sync: Coordination between computation phases
  • Access pattern: Stride-based parallel computation
  • Type safety: Leveraging LayoutTensor’s type system

Code to complete

alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn prefix_sum_simple[
    layout: Layout
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 12 lines)


View full file: problems/p12/p12.mojo

Tips
  1. Load data into shared[local_i]
  2. Use offset = 1 and double it each step
  3. Add elements where local_i >= offset
  4. Call barrier() between steps

Running the code

To test your solution, run the following command in your terminal:

uv run poe p12 --simple
pixi run p12 --simple

Your output will look like this if the puzzle isn’t solved yet:

out: DeviceBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0])

Solution

fn prefix_sum_simple[
    layout: Layout
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared = tb[dtype]().row_major[TPB]().shared().alloc()
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    offset = 1
    for i in range(Int(log2(Scalar[dtype](TPB)))):
        if local_i >= offset and local_i < size:
            shared[local_i] += shared[local_i - offset]

        barrier()
        offset *= 2

    if global_i < size:
        out[global_i] = shared[local_i]


The parallel (inclusive) prefix-sum algorithm works as follows:

Setup & Configuration

  • TPB (Threads Per Block) = 8
  • SIZE (Array Size) = 8

Thread Mapping

  • thread_idx.x: \([0, 1, 2, 3, 4, 5, 6, 7]\) (local_i)
  • block_idx.x: \([0, 0, 0, 0, 0, 0, 0, 0]\)
  • global_i: \([0, 1, 2, 3, 4, 5, 6, 7]\) (block_idx.x * TPB + thread_idx.x)

Initial Load to Shared Memory

Threads:      T₀   T₁   T₂   T₃   T₄   T₅   T₆   T₇
Input array:  [0    1    2    3    4    5    6    7]
shared:       [0    1    2    3    4    5    6    7]
               ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑
              T₀   T₁   T₂   T₃   T₄   T₅   T₆   T₇

Offset = 1: First Parallel Step

Active threads: \(T_1 \ldots T_7\) (where local_i ≥ 1)

Before:      [0    1    2    3    4    5    6    7]
Add:              +0   +1   +2   +3   +4   +5   +6
                   |    |    |    |    |    |    |
Result:      [0    1    3    5    7    9    11   13]
                   ↑    ↑    ↑    ↑    ↑    ↑    ↑
                  T₁   T₂   T₃   T₄   T₅   T₆   T₇

Offset = 2: Second Parallel Step

Active threads: \(T_2 \ldots T_7\) (where local_i ≥ 2)

Before:      [0    1    3    5    7    9    11   13]
Add:                   +0   +1   +3   +5   +7   +9
                        |    |    |    |    |    |
Result:      [0    1    3    6    10   14   18   22]
                        ↑    ↑    ↑    ↑    ↑    ↑
                       T₂   T₃   T₄   T₅   T₆   T₇

Offset = 4: Third Parallel Step

Active threads: \(T_4 \ldots T_7\) (where local_i ≥ 4)

Before:      [0    1    3    6    10   14   18   22]
Add:                              +0   +1   +3   +7
                                  |    |    |    |
Result:      [0    1    3    6    10   15   21   28]
                                  ↑    ↑    ↑    ↑
                                  T₄   T₅   T₆   T₇

Final Write to Output

Threads:      T₀   T₁   T₂   T₃   T₄   T₅   T₆   T₇
global_i:     0    1    2    3    4    5    6    7
out[]:       [0    1    3    6    10   15   21   28]
              ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑
              T₀   T₁   T₂   T₃   T₄   T₅   T₆   T₇

Thread-by-Thread Execution

\(T_0\) (local_i=0):

  • Loads shared[0] = 0
  • Never adds (local_i < offset always)
  • Writes out[0] = 0

\(T_1\) (local_i=1):

  • Loads shared[1] = 1
  • offset=1: adds shared[0] → 1
  • offset=2,4: no action (local_i < offset)
  • Writes out[1] = 1

\(T_2\) (local_i=2):

  • Loads shared[2] = 2
  • offset=1: adds shared[1] → 3
  • offset=2: adds shared[0] → 3
  • offset=4: no action
  • Writes out[2] = 3

\(T_3\) (local_i=3):

  • Loads shared[3] = 3
  • offset=1: adds shared[2] → 5
  • offset=2: adds shared[1] → 6
  • offset=4: no action
  • Writes out[3] = 7

\(T_4\) (local_i=4):

  • Loads shared[4] = 4
  • offset=1: adds shared[3] → 7
  • offset=2: adds shared[2] → 10
  • offset=4: adds shared[0] → 10
  • Writes out[4] = 10

\(T_5\) (local_i=5):

  • Loads shared[5] = 5
  • offset=1: adds shared[4] → 9
  • offset=2: adds shared[3] → 14
  • offset=4: adds shared[1] → 15
  • Writes out[5] = 16

\(T_6\) (local_i=6):

  • Loads shared[6] = 6
  • offset=1: adds shared[5] → 11
  • offset=2: adds shared[4] → 18
  • offset=4: adds shared[2] → 21
  • Writes out[6] = 21

\(T_7\) (local_i=7):

  • Loads shared[7] = 7
  • offset=1: adds shared[6] → 13
  • offset=2: adds shared[5] → 22
  • offset=4: adds shared[3] → 28
  • Writes out[7] = 28

The solution ensures correct synchronization between phases using barrier() and handles array bounds checking with if global_i < size. The final result produces the inclusive prefix sum where each element \(i\) contains \(\sum_{j=0}^{i} a[j]\).

Complete Version

Implement a kernel that computes a prefix-sum over 1D LayoutTensor a and stores it in 1D LayoutTensor out.

Note: If the size of a is greater than the block size, we need to synchronize across multiple blocks to get the correct result.

Configuration

  • Array size: SIZE_2 = 15 elements
  • Threads per block: TPB = 8
  • Number of blocks: 2
  • Shared memory: TPB elements per block

Notes:

  • Multiple blocks: When the input array is larger than one block, we need a multi-phase approach
  • Block-level sync: Within a block, use barrier() to synchronize threads
  • Host-level sync: Between blocks, use ctx.synchronize() at the host level
  • Auxiliary storage: Use extra space to store block sums for cross-block communication

Code to complete

You need to complete two separate kernel functions for the multi-block prefix sum:

  1. First kernel (prefix_sum_local_phase): Computes local prefix sums within each block and stores block sums
  2. Second kernel (prefix_sum_block_sum_phase): Adds previous block sums to elements in subsequent blocks

The main function will handle the necessary host-side synchronization between these kernels.

alias SIZE_2 = 15
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (TPB, 1)
alias EXTENDED_SIZE = SIZE_2 + 2  # up to 2 blocks
alias extended_layout = Layout.row_major(EXTENDED_SIZE)


# Kernel 1: Compute local prefix sums and store block sums in out
fn prefix_sum_local_phase[
    out_layout: Layout, in_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    size: Int,
    num_blocks: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 14 lines)


# Kernel 2: Add block sums to their respective blocks
fn prefix_sum_block_sum_phase[
    layout: Layout
](out: LayoutTensor[mut=False, dtype, layout], size: Int):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 3 lines)


View full file: problems/p12/p12.mojo

The key to this puzzle is understanding that barrier only synchronizes threads within a block, not across blocks. For cross-block synchronization, you need to use host-level synchronization:

            # Phase 1: Local prefix sums
            ctx.enqueue_function[
                prefix_sum_local_phase[extended_layout, extended_layout]
            ](
                out_tensor,
                a_tensor,
                size,
                num_blocks,
                grid_dim=BLOCKS_PER_GRID_2,
                block_dim=THREADS_PER_BLOCK_2,
            )

            # Wait for all `blocks` to complete with using host `ctx.synchronize()`
            # Note this is in contrast with using `barrier()` in the kernel
            # which is a synchronization point for all threads in the same block and not across blocks.
            ctx.synchronize()

            # Phase 2: Add block sums
            ctx.enqueue_function[prefix_sum_block_sum_phase[extended_layout]](
                out_tensor,
                size,
                grid_dim=BLOCKS_PER_GRID_2,
                block_dim=THREADS_PER_BLOCK_2,
            )
Tips

1. Build on the simple prefix sum

The Simple Version shows how to implement a single-block prefix sum. You’ll need to extend that approach to work across multiple blocks:

Simple version (single block): [0,1,2,3,4,5,6,7] → [0,1,3,6,10,15,21,28]

Complete version (two blocks):
Block 0: [0,1,2,3,4,5,6,7] → [0,1,3,6,10,15,21,28]
Block 1: [8,9,10,11,12,13,14] → [8,17,27,38,50,63,77]

But how do we handle the second block’s values? They need to include sums from the first block!

2. Two-phase approach

The simple prefix sum can’t synchronize across blocks, so split the work:

  1. First phase: Each block computes its own local prefix sum (just like the simple version)
  2. Second phase: Blocks incorporate the sums from previous blocks

Remember: barrier() only synchronizes threads within one block. You need host-level synchronization between phases.

3. Extended memory strategy

Since blocks can’t directly communicate, you need somewhere to store block sums:

  • Allocate extra memory at the end of your output buffer
  • Last thread in each block stores its final sum in this extra space
  • Subsequent blocks can read these sums and add them to their elements

4. Key implementation insights

  • Different layouts: Input and output may have different shapes
  • Boundary handling: Always check global_i < size for array bounds
  • Thread role specialization: Only specific threads (e.g., last thread) should store block sums
  • Two kernel synchronization: Use ctx.synchronize() between kernel launches

5. Debugging Strategy

If you encounter issues, try visualizing the intermediate state after the first phase:

After first phase: [0,1,3,6,10,15,21,28, 8,17,27,38,50,63,77, ???,???]

Where ??? should contain your block sums that will be used in the second phase.

Running the code

To test your solution, run the following command in your terminal:

uv run poe p12 --complete
pixi run p12 --complete

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0, 36.0, 45.0, 55.0, 66.0, 78.0, 91.0, 105.0])

Solution



# Kernel 1: Compute local prefix sums and store block sums in out
fn prefix_sum_local_phase[
    out_layout: Layout, in_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    size: Int,
    num_blocks: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    # Load data into shared memory
    # Example with SIZE_2=15, TPB=8, BLOCKS=2:
    # Block 0 shared mem: [0,1,2,3,4,5,6,7]
    # Block 1 shared mem: [8,9,10,11,12,13,14,0]  (last value padded with 0)
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # Compute local prefix sum using parallel reduction
    # This uses a tree-based algorithm with log(TPB) iterations
    # Iteration 1 (offset=1):
    #   Block 0: [0,0+1,2+1,3+2,4+3,5+4,6+5,7+6] = [0,1,3,5,7,9,11,13]
    # Iteration 2 (offset=2):
    #   Block 0: [0,1,3+0,5+1,7+3,9+5,11+7,13+9] = [0,1,3,6,10,14,18,22]
    # Iteration 3 (offset=4):
    #   Block 0: [0,1,3,6,10+0,14+1,18+3,22+6] = [0,1,3,6,10,15,21,28]
    # Block 1 follows same pattern to get [8,17,27,38,50,63,77,...]
    offset = 1
    for i in range(Int(log2(Scalar[dtype](TPB)))):
        if local_i >= offset and local_i < TPB:
            shared[local_i] += shared[local_i - offset]
        barrier()
        offset *= 2

    # Write local results to output
    # Block 0 writes: [0,1,3,6,10,15,21,28]
    # Block 1 writes: [8,17,27,38,50,63,77,...]
    if global_i < size:
        out[global_i] = shared[local_i]

    # Store block sums in auxiliary space
    # Block 0: Thread 7 stores 28 at position size+0 (position 15)
    # Block 1: Thread 7 stores 77 at position size+1 (position 16)
    # This gives us: [0,1,3,6,10,15,21,28, 8,17,27,38,50,63,77, 28,77]
    #                                                           ↑  ↑
    #                                                     Block sums here
    if local_i == TPB - 1:
        out[size + block_idx.x] = shared[local_i]


# Kernel 2: Add block sums to their respective blocks
fn prefix_sum_block_sum_phase[
    layout: Layout
](out: LayoutTensor[mut=False, dtype, layout], size: Int):
    global_i = block_dim.x * block_idx.x + thread_idx.x

    # Second pass: add previous block's sum to each element
    # Block 0: No change needed - already correct
    # Block 1: Add Block 0's sum (28) to each element
    #   Before: [8,17,27,38,50,63,77]
    #   After: [36,45,55,66,78,91,105]
    # Final result combines both blocks:
    # [0,1,3,6,10,15,21,28, 36,45,55,66,78,91,105]
    if block_idx.x > 0 and global_i < size:
        prev_block_sum = out[size + block_idx.x - 1]
        out[global_i] += prev_block_sum


This solution implements a multi-block prefix sum using a two-kernel approach to handle an array that spans multiple thread blocks. Let’s break down each aspect in detail:

The challenge of cross-block communication

The fundamental limitation in GPU programming is that threads can only synchronize within a block using barrier(). When data spans multiple blocks, we face the challenge: How do we ensure blocks can communicate their partial results to other blocks?

Memory layout visualization

For our test case with SIZE_2 = 15 and TPB = 8:

Input array:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

Block 0 processes: [0, 1, 2, 3, 4, 5, 6, 7]
Block 1 processes: [8, 9, 10, 11, 12, 13, 14, (padding)]

We extend the output buffer to include space for block sums:

Extended buffer: [data values (15 elements)] + [block sums (2 elements)]
                 [0...14] + [block0_sum, block1_sum]

The size of this extended buffer is: EXTENDED_SIZE = SIZE_2 + num_blocks = 15 + 2 = 17

Phase 1 kernel: Local prefix sums

Step-by-step execution for Block 0

  1. Load values into shared memory:

    shared = [0, 1, 2, 3, 4, 5, 6, 7]
    
  2. Iterations of parallel reduction (\(\log_2(TPB) = 3\) iterations):

    Iteration 1 (offset=1):

    shared[0] = 0              (unchanged)
    shared[1] = 1 + 0 = 1
    shared[2] = 2 + 1 = 3
    shared[3] = 3 + 2 = 5
    shared[4] = 4 + 3 = 7
    shared[5] = 5 + 4 = 9
    shared[6] = 6 + 5 = 11
    shared[7] = 7 + 6 = 13
    

    After barrier: shared = [0, 1, 3, 5, 7, 9, 11, 13]

    Iteration 2 (offset=2):

    shared[0] = 0              (unchanged)
    shared[1] = 1              (unchanged)
    shared[2] = 3 + 0 = 3      (unchanged)
    shared[3] = 5 + 1 = 6
    shared[4] = 7 + 3 = 10
    shared[5] = 9 + 5 = 14
    shared[6] = 11 + 7 = 18
    shared[7] = 13 + 9 = 22
    

    After barrier: shared = [0, 1, 3, 6, 10, 14, 18, 22]

    Iteration 3 (offset=4):

    shared[0] = 0              (unchanged)
    shared[1] = 1              (unchanged)
    shared[2] = 3              (unchanged)
    shared[3] = 6              (unchanged)
    shared[4] = 10 + 0 = 10    (unchanged)
    shared[5] = 14 + 1 = 15
    shared[6] = 18 + 3 = 21
    shared[7] = 22 + 6 = 28
    

    After barrier: shared = [0, 1, 3, 6, 10, 15, 21, 28]

  3. Write local results back to global memory:

    out[0...7] = [0, 1, 3, 6, 10, 15, 21, 28]
    
  4. Store block sum in auxiliary space (only last thread):

    out[15] = 28  // at position size + block_idx.x = 15 + 0
    

Step-by-step execution for Block 1

  1. Load values into shared memory:

    shared = [8, 9, 10, 11, 12, 13, 14, 0] // Last value padded with 0
    
  2. Iterations of parallel reduction (\(\log_2(TPB) = 3\) iterations):

    With similar iterations as Block 0, after all three iterations:

    shared = [8, 17, 27, 38, 50, 63, 77, 77]
    
  3. Write local results back to global memory:

    out[8...14] = [8, 17, 27, 38, 50, 63, 77]
    
  4. Store block sum in auxiliary space (only last thread):

    out[16] = 77  // at position size + block_idx.x = 15 + 1
    

After Phase 1, the output buffer contains:

[0, 1, 3, 6, 10, 15, 21, 28, 8, 17, 27, 38, 50, 63, 77, 28, 77]
                                                        ^   ^
                                                Block sums stored here

Host-side synchronization: The critical step

Between phases 1 and 2, we call:

ctx.synchronize()

This is the most crucial part of the algorithm! Without this synchronization, the second kernel might start before the first one completes, leading to race conditions and incorrect results. This is a fundamental difference from single-block algorithms where barrier() would be sufficient.

Phase 2 kernel: Block sum addition

  1. Block 0: No changes needed (it’s already correct).

  2. Block 1: Each thread adds Block 0’s sum to its element:

    prev_block_sum = out[size + block_idx.x - 1] = out[15] = 28
    out[global_i] += prev_block_sum
    

    Block 1 values are transformed:

    Before: [8, 17, 27, 38, 50, 63, 77]
    After:  [36, 45, 55, 66, 78, 91, 105]
    

Performance and optimization considerations

  1. Work efficiency: This implementation has \(O(n \log n)\) work complexity, while the sequential algorithm is \(O(n)\). This is a classic space-time tradeoff in parallel algorithms.

  2. Memory overhead: The extra space for block sums is minimal (just one element per block).

This two-kernel approach is a fundamental pattern in GPU programming for algorithms that require cross-block communication. The same strategy can be applied to other parallel algorithms like radix sort, histogram calculation, and reduction operations.

Puzzle 13: Axis Sum

Overview

Implement a kernel that computes a sum over each row of 2D matrix a and stores it in out using LayoutTensor.

Axis Sum visualization

Key concepts

In this puzzle, you’ll learn about:

  • Parallel reduction along matrix dimensions using LayoutTensor
  • Using block coordinates for data partitioning
  • Efficient shared memory reduction patterns
  • Working with multi-dimensional tensor layouts

The key insight is understanding how to map thread blocks to matrix rows and perform efficient parallel reduction within each block while leveraging LayoutTensor’s dimensional indexing.

Configuration

  • Matrix dimensions: \(\text{BATCH} \times \text{SIZE} = 4 \times 6\)
  • Threads per block: \(\text{TPB} = 8\)
  • Grid dimensions: \(1 \times \text{BATCH}\)
  • Shared memory: \(\text{TPB}\) elements per block
  • Input layout: Layout.row_major(BATCH, SIZE)
  • Output layout: Layout.row_major(BATCH, 1)

Matrix visualization:

Row 0: [0, 1, 2, 3, 4, 5]       → Block(0,0)
Row 1: [6, 7, 8, 9, 10, 11]     → Block(0,1)
Row 2: [12, 13, 14, 15, 16, 17] → Block(0,2)
Row 3: [18, 19, 20, 21, 22, 23] → Block(0,3)

Code to Complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb


alias TPB = 8
alias BATCH = 4
alias SIZE = 6
alias BLOCKS_PER_GRID = (1, BATCH)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias in_layout = Layout.row_major(BATCH, SIZE)
alias out_layout = Layout.row_major(BATCH, 1)


fn axis_sum[
    in_layout: Layout, out_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    batch = block_idx.y
    # FILL ME IN (roughly 15 lines)


View full file: problems/p13/p13.mojo

Tips
  1. Use batch = block_idx.y to select row
  2. Load elements: cache[local_i] = a[batch * size + local_i]
  3. Perform parallel reduction with halving stride
  4. Thread 0 writes final sum to out[batch]

Running the Code

To test your solution, run the following command in your terminal:

uv run poe p13
pixi run p13

Your output will look like this if the puzzle isn’t solved yet:

out: DeviceBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([15.0, 51.0, 87.0, 123.0])

Solution

fn axis_sum[
    in_layout: Layout, out_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    batch = block_idx.y
    cache = tb[dtype]().row_major[TPB]().shared().alloc()

    # Visualize:
    # Block(0,0): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 0: [0,1,2,3,4,5]
    # Block(0,1): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 1: [6,7,8,9,10,11]
    # Block(0,2): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 2: [12,13,14,15,16,17]
    # Block(0,3): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 3: [18,19,20,21,22,23]

    # each row is handled by each block bc we have grid_dim=(1, BATCH)

    if local_i < size:
        cache[local_i] = a[batch, local_i]
    else:
        # Add zero-initialize padding elements for later reduction
        cache[local_i] = 0

    barrier()

    # do reduction sum per each block
    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            cache[local_i] += cache[local_i + stride]

        barrier()
        stride //= 2

    # writing with local thread = 0 that has the sum for each batch
    if local_i == 0:
        out[batch, 0] = cache[0]


The solution implements a parallel row-wise sum reduction for a 2D matrix using LayoutTensor. Here’s a comprehensive breakdown:

Matrix Layout and Block Mapping

Input Matrix (4×6) with LayoutTensor:                Block Assignment:
[[ a[0,0]  a[0,1]  a[0,2]  a[0,3]  a[0,4]  a[0,5] ] → Block(0,0)
 [ a[1,0]  a[1,1]  a[1,2]  a[1,3]  a[1,4]  a[1,5] ] → Block(0,1)
 [ a[2,0]  a[2,1]  a[2,2]  a[2,3]  a[2,4]  a[2,5] ] → Block(0,2)
 [ a[3,0]  a[3,1]  a[3,2]  a[3,3]  a[3,4]  a[3,5] ] → Block(0,3)

Parallel Reduction Process

  1. Initial Data Loading:

    Block(0,0): cache = [a[0,0] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] * *]  // * = padding
    Block(0,1): cache = [a[1,0] a[1,1] a[1,2] a[1,3] a[1,4] a[1,5] * *]
    Block(0,2): cache = [a[2,0] a[2,1] a[2,2] a[2,3] a[2,4] a[2,5] * *]
    Block(0,3): cache = [a[3,0] a[3,1] a[3,2] a[3,3] a[3,4] a[3,5] * *]
    
  2. Reduction Steps (for Block 0,0):

    Initial:  [0  1  2  3  4  5  *  *]
    Stride 4: [4  5  6  7  4  5  *  *]
    Stride 2: [10 12 6  7  4  5  *  *]
    Stride 1: [15 12 6  7  4  5  *  *]
    

Key Implementation Features:

  1. Layout Configuration:

    • Input: row-major layout (BATCH × SIZE)
    • Output: row-major layout (BATCH × 1)
    • Each block processes one complete row
  2. Memory Access Pattern:

    • LayoutTensor 2D indexing for input: a[batch, local_i]
    • Shared memory for efficient reduction
    • LayoutTensor 2D indexing for output: out[batch, 0]
  3. Parallel Reduction Logic:

    stride = TPB // 2
    while stride > 0:
        if local_i < size:
            cache[local_i] += cache[local_i + stride]
        barrier()
        stride //= 2
    
  4. Output Writing:

    if local_i == 0:
        out[batch, 0] = cache[0]  --> One result per batch
    

Performance Optimizations:

  1. Memory Efficiency:

    • Coalesced memory access through LayoutTensor
    • Shared memory for fast reduction
    • Single write per row result
  2. Thread Utilization:

    • Perfect load balancing across rows
    • No thread divergence in main computation
    • Efficient parallel reduction pattern
  3. Synchronization:

    • Minimal barriers (only during reduction)
    • Independent processing between rows
    • No inter-block communication needed

Complexity Analysis:

  • Time: \(O(\log n)\) per row, where n is row length
  • Space: \(O(TPB)\) shared memory per block
  • Total parallel time: \(O(\log n)\) with sufficient threads

Puzzle 14: Matrix Multiplication (MatMul)

Overview

Matrix multiplication is a fundamental operation in scientific computing, machine learning, and graphics. Given two matrices \(A\) and \(B\), we want to compute their product \(C = A \times B.\)

For matrices \(A_{m\times k}\) and \(B_{k\times n}\), each element of the result \(C_{m\times n}\) is computed as:

\[\Large C_{ij} = \sum_{l=0}^{k-1} A_{il} \cdot B_{lj} \]

Matrix Multiply visualization

This puzzle explores different approaches to implementing matrix multiplication on GPUs, each with its own performance characteristics:

  • Naive Version The straightforward implementation where each thread computes one element of the output matrix. While simple to understand, this approach makes many redundant memory accesses.

  • Shared Memory Version Improves performance by loading blocks of input matrices into fast shared memory, reducing global memory accesses. Each thread still computes one output element but reads from shared memory.

  • Tiled Version Further optimizes by dividing the computation into tiles, allowing threads to cooperate on loading and computing blocks of the output matrix. This approach better utilizes memory hierarchy and thread cooperation.

Each version builds upon the previous one, introducing new optimization techniques common in GPU programming. You’ll learn how different memory access patterns and thread cooperation strategies affect performance.

The progression illustrates a common pattern in GPU optimization:

  1. Start with a correct but naive implementation
  2. Reduce global memory access with shared memory
  3. Improve data locality and thread cooperation with tiling
  4. Use high-level abstractions while maintaining performance

Choose a version to begin your matrix multiplication journey!

Naïve Matrix Multiplication

Overview

Implement a kernel that multiplies square matrices \(A\) and \(B\) and stores the result in \(\text{out}\). This is the most straightforward implementation where each thread computes one element of the output matrix.

Key concepts

In this puzzle, you’ll learn about:

  • 2D thread organization for matrix operations
  • Global memory access patterns
  • Matrix indexing in row-major layout
  • Thread-to-output element mapping

The key insight is understanding how to map 2D thread indices to matrix elements and compute dot products in parallel.

Configuration

  • Matrix size: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
  • Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • Grid dimensions: \(1 \times 1\)

Layout configuration:

  • Input A: Layout.row_major(SIZE, SIZE)
  • Input B: Layout.row_major(SIZE, SIZE)
  • Output: Layout.row_major(SIZE, SIZE)

Code to complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb


alias TPB = 3
alias SIZE = 2
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, TPB)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE, SIZE)


fn naive_matmul[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 6 lines)


View full file: problems/p14/p14.mojo

Tips
  1. Calculate row and col from thread indices
  2. Check if indices are within size
  3. Accumulate products in a local variable
  4. Write final sum to correct output position

Running the code

To test your solution, run the following command in your terminal:

uv run poe p14 --naive
pixi run p14 --naive

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([4.0, 6.0, 12.0, 22.0])

Solution

fn naive_matmul[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x

    if row < size and col < size:
        var acc: out.element_type = 0

        @parameter
        for k in range(size):
            acc += a[row, k] * b[k, col]

        out[row, col] = acc


The naive matrix multiplication using LayoutTensor demonstrates the basic approach:

Matrix Layout (2×2 example)

Matrix A:          Matrix B:                   Output C:
[a[0,0] a[0,1]]    [b[0,0] b[0,1]]             [c[0,0] c[0,1]]
[a[1,0] a[1,1]]    [b[1,0] b[1,1]]             [c[1,0] c[1,1]]

Implementation Details:

  1. Thread mapping:

    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    
  2. Memory access pattern:

    • Direct 2D indexing: a[row, k]
    • Transposed access: b[k, col]
    • Output writing: out[row, col]
  3. Computation flow:

    # Use var for mutable accumulator with tensor's element type
    var acc: out.element_type = 0
    
    # @parameter for compile-time loop unrolling
    @parameter
    for k in range(size):
        acc += a[row, k] * b[k, col]
    

Key language features:

  1. Variable declaration:

    • The use of var in var acc: out.element_type = 0 allows for type inference with out.element_type ensures type compatibility with the output tensor
    • Initialized to zero before accumulation
  2. Loop pptimization:

    • @parameter decorator unrolls the loop at compile time
    • Improves performance for small, known matrix sizes
    • Enables better instruction scheduling

Performance characteristics:

  1. Memory access:

    • Each thread makes 2 x SIZE global memory reads
    • One global memory write per thread
    • No data reuse between threads
  2. Computational efficiency:

    • Simple implementation but suboptimal performance
    • Many redundant global memory accesses
    • No use of fast shared memory
  3. Limitations:

    • High global memory bandwidth usage
    • Poor data locality
    • Limited scalability for large matrices

This naive implementation serves as a baseline for understanding matrix multiplication on GPUs, highlighting the need for optimization in memory access patterns.

Understanding GPU Performance: The Roofline Model

Having implemented the naive matrix multiplication, you might be wondering: How well is our kernel actually performing? Is it limited by the GPU’s computational power, or is something else holding it back?

The roofline model is your compass for GPU optimization—it reveals which hardware bottleneck limits your kernel’s performance and guides you toward the most impactful optimizations. Rather than guessing at improvements, the roofline model shows you exactly where to focus your efforts.

1. Two ceilings for every GPU kernel

Every GPU kernel operates under two fundamental constraints:

  • Compute ceiling – how quickly the cores can execute floating-point operations (peak FLOPs/s)
  • Memory ceiling – how quickly the memory system can feed those cores with data (peak bytes/s)

Understanding which ceiling constrains your kernel is crucial for optimization strategy. The roofline model visualizes this relationship by plotting two key metrics:

X-axis: Arithmetic Intensity – How much computation you extract per byte of data

\[\Large I = \frac{\text{Total FLOPs}}{\text{Total Bytes from Memory}} \quad [\text{FLOP/B}]\]

Y-axis: Sustained Performance – How fast your kernel actually runs

\[\Large P_{\text{sustained}} = \frac{\text{Total FLOPs}}{\text{Elapsed Time}} \quad [\text{GFLOP/s}]\]

Two “roofs” bound all achievable performance:

RoofEquationMeaning
Memory roof\(P = B_{\text{peak}} \cdot I\)Sloped line; performance limited by memory bandwidth
Compute roof\(P = P_{\text{peak}}\)Horizontal line; performance limited by compute throughput

The critical intensity

\[\Large I^* = \frac{P_{\text{peak}}}{B_{\text{peak}}}\]

marks where a kernel transitions from memory-bound (\(I < I^* \)) to compute-bound (\(I > I^* \)).

2. Hardware example: NVIDIA A100 specifications

Let’s ground this theory in concrete numbers using the NVIDIA A100:

Peak FP32 throughput \[\Large P_{\text{peak}} = 19.5 \text{ TFLOP/s} = 19{,}500 \text{ GFLOP/s}\]

Peak HBM2 bandwidth \[\Large B_{\text{peak}} = 1{,}555 \text{ GB/s}\]

Critical intensity \[\Large I^* = \frac{19{,}500}{1{,}555} \approx 12.5 \text{ FLOP/B}\]

Source: NVIDIA A100 Tensor Core GPU Architecture

This means kernels with arithmetic intensity below 12.5 FLOP/B are memory-bound, while those above are compute-bound.

3. Visualizing our matrix multiplication implementations

The animation below shows how our puzzle implementations map onto the A100’s roofline model:

Roofline Model Visualization

The visualization demonstrates the optimization journey we’ll take in this puzzle:

  1. Hardware constraints – The red memory roof and blue compute roof define performance limits
  2. Our starting point – The naive implementation (left purple dot) sitting firmly on the memory roof
  3. Optimization target – The shared memory version (right purple dot) with improved arithmetic intensity
  4. Ultimate goal – The golden arrow pointing toward the critical intensity where kernels become compute-bound

4. Analyzing our naive implementation

Let’s examine why our naive kernel from the previous section performs as it does. For our \(2 \times 2\) matrix multiplication:

Computation per output element: \(\text{SIZE} + (\text{SIZE}-1) = 3 \text{ FLOPs }\)

Each element requires \(\text{SIZE}\) multiplications and \(\text{SIZE} - 1\) additions: \[C_{00} = A_{00} \cdot B_{00} + A_{01} \cdot B_{10}\] For \(\text{SIZE} = 2\) it is 2 multiplications + 1 addition = 3 FLOPs

Memory accesses per output element:

  • Row from matrix A: \(2 \times 4 = 8\) bytes (FP32)
  • Column from matrix B: \(2 \times 4 = 8\) bytes (FP32)
  • Total: \(16\) bytes per output element

Arithmetic intensity: \[\Large I_{\text{naive}} = \frac{3 \text{ FLOPs}}{16 \text{ bytes}} = 0.1875 \text{ FLOP/B}\]

Since \(I_{\text{naive}} = 0.1875 \ll I^* = 12.5\), our naive kernel is severely memory-bound.

Expected performance: \[\Large P \approx B_{\text{peak}} \times I_{\text{naive}} = 1{,}555 \times 0.1875 \approx 292 \text{ GFLOP/s}\]

This represents only \(\frac{292}{19{,}500} \approx 1.5%\) of the GPU’s computational potential! The visualization clearly shows this as the leftmost purple dot sitting squarely on the memory roof—we’re nowhere near the compute ceiling.

5. The path forward: shared memory optimization

The roofline model reveals our optimization strategy: increase arithmetic intensity by reducing redundant memory accesses. This is exactly what the shared memory approach accomplishes:

Shared memory benefits:

  • Cooperative loading: Threads work together to load matrix blocks into fast shared memory
  • Data reuse: Each loaded element serves multiple computations
  • Reduced global memory traffic: Fewer accesses to slow global memory

Expected arithmetic intensity improvement: \[\Large I_{\text{shared}} = \frac{12 \text{ FLOPs}}{32 \text{ bytes}} = 0.375 \text{ FLOP/B}\]

While still memory-bound for our small \(2 \times 2\) case, this 2× improvement in arithmetic intensity scales dramatically for larger matrices where shared memory tiles can be reused many more times.

6. Optimization strategies revealed by the roofline

The roofline model not only diagnoses current performance but also illuminates optimization paths. Here are the key techniques we’ll explore in later puzzles:

TechniqueRoofline effectImplementation approach
Shared memory tiling↑ Arithmetic intensity through data reuseCooperative loading, block-wise computation
Register blockingReduce memory traffic with register accumulationLoop unrolling with register variables
Kernel fusionMore FLOPs per byte by combining operationsSingle kernel handling multiple computation stages
Memory coalescingMaximize effective bandwidth utilizationStructured access patterns, proper thread organization
Mixed precisionSmaller data types reduce memory pressureFP16/BF16 input with FP32 accumulation

Each technique moves kernels along the roofline—either up the memory roof (better bandwidth utilization) or rightward toward the compute roof (higher arithmetic intensity).

7. Beyond simple rooflines

Multi-level memory: Advanced rooflines include separate ceilings for L2 cache, shared memory, and register bandwidth to identify which memory hierarchy level constrains performance.

Communication rooflines: For multi-GPU applications, replace memory bandwidth with interconnect bandwidth (NVLink, InfiniBand) to analyze scaling efficiency.

Specialized units: Modern GPUs include tensor cores with their own performance characteristics, requiring specialized roofline analysis.

8. Using the roofline in practice

  1. Profile your kernel: Use tools like Nsight Compute to measure actual FLOPs and memory traffic
  2. Plot the data point: Calculate arithmetic intensity and sustained performance
  3. Identify the bottleneck: Memory-bound kernels sit on the memory roof; compute-bound kernels approach the compute roof
  4. Choose optimizations: Focus on bandwidth improvements for memory-bound kernels, algorithmic changes for compute-bound ones
  5. Measure and iterate: Verify that optimizations move kernels in the expected direction

Connection to our shared memory puzzle

In the next section, we’ll implement the shared memory optimization that begins moving our kernel up the roofline. As the visualization shows, this takes us from the left purple dot (naive) to the right purple dot (shared memory)—a clear performance improvement through better data reuse.

While our \(2 \times 2\) example won’t reach the compute roof, you’ll see how the same principles scale to larger matrices where shared memory becomes crucial for performance. The roofline model provides the theoretical foundation for understanding why shared memory helps and how much improvement to expect.

Understanding the roofline model transforms GPU optimization from guesswork into systematic engineering. Every optimization technique in this book can be understood through its effect on this simple but powerful performance model.

Shared Memory Matrix Multiplication

Overview

Implement a kernel that multiplies square matrices \(A\) and \(B\) and stores the result in \(\text{out}\), using shared memory to improve memory access efficiency. This version loads matrix blocks into shared memory before computation.

Key concepts

In this puzzle, you’ll learn about:

  • Block-local memory management with LayoutTensor
  • Thread synchronization patterns
  • Memory access optimization using shared memory
  • Collaborative data loading with 2D indexing
  • Efficient use of LayoutTensor for matrix operations

The key insight is understanding how to use fast shared memory with LayoutTensor to reduce expensive global memory operations.

Configuration

  • Matrix size: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
  • Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • Grid dimensions: \(1 \times 1\)

Layout configuration:

  • Input A: Layout.row_major(SIZE, SIZE)
  • Input B: Layout.row_major(SIZE, SIZE)
  • Output: Layout.row_major(SIZE, SIZE)
  • Shared Memory: Two TPB × TPB LayoutTensors

Memory organization:

Global Memory (LayoutTensor):          Shared Memory (LayoutTensor):
A[i,j]: Direct access                  a_shared[local_row, local_col]
B[i,j]: Direct access                  b_shared[local_row, local_col]

Code to complete

fn single_block_matmul[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    local_row = thread_idx.y
    local_col = thread_idx.x
    # FILL ME IN (roughly 12 lines)


View full file: problems/p14/p14.mojo

Tips
  1. Load matrices to shared memory using global and local indices
  2. Call barrier() after loading
  3. Compute dot product using shared memory indices
  4. Check array bounds for all operations

Running the code

To test your solution, run the following command in your terminal:

uv run poe p14 --single-block
pixi run p14 --single-block

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([4.0, 6.0, 12.0, 22.0])

Solution

fn single_block_matmul[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    local_row = thread_idx.y
    local_col = thread_idx.x

    a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()

    if row < size and col < size:
        a_shared[local_row, local_col] = a[row, col]
        b_shared[local_row, local_col] = b[row, col]

    barrier()

    if row < size and col < size:
        var acc: out.element_type = 0

        @parameter
        for k in range(size):
            acc += a_shared[local_row, k] * b_shared[k, local_col]

        out[row, col] = acc


The shared memory implementation with LayoutTensor improves performance through efficient memory access patterns:

Memory organization

Input Tensors (2×2):                Shared Memory (3×3):
Matrix A:                           a_shared:
 [a[0,0] a[0,1]]                     [s[0,0] s[0,1] s[0,2]]
 [a[1,0] a[1,1]]                     [s[1,0] s[1,1] s[1,2]]
                                     [s[2,0] s[2,1] s[2,2]]
Matrix B:                           b_shared: (similar layout)
 [b[0,0] b[0,1]]                     [t[0,0] t[0,1] t[0,2]]
 [b[1,0] b[1,1]]                     [t[1,0] t[1,1] t[1,2]]
                                     [t[2,0] t[2,1] t[2,2]]

Implementation Phases:

  1. Shared Memory Setup:

    # Create 2D shared memory tensors using TensorBuilder
    a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    
  2. Thread Indexing:

    # Global indices for matrix access
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    
    # Local indices for shared memory
    local_row = thread_idx.y
    local_col = thread_idx.x
    
  3. Data Loading:

    # Load data into shared memory using LayoutTensor indexing
    if row < size and col < size:
        a_shared[local_row, local_col] = a[row, col]
        b_shared[local_row, local_col] = b[row, col]
    
  4. Computation with Shared Memory:

    # Guard ensures we only compute for valid matrix elements
    if row < size and col < size:
        # Initialize accumulator with output tensor's type
        var acc: out.element_type = 0
    
        # Compile-time unrolled loop for matrix multiplication
        @parameter
        for k in range(size):
            acc += a_shared[local_row, k] * b_shared[k, local_col]
    
        # Write result only for threads within matrix bounds
        out[row, col] = acc
    

    Key aspects:

    • Boundary check: if row < size and col < size

      • Prevents out-of-bounds computation
      • Only valid threads perform work
      • Essential because TPB (3×3) > SIZE (2×2)
    • Accumulator Type: var acc: out.element_type

      • Uses output tensor’s element type for type safety
      • Ensures consistent numeric precision
      • Initialized to zero before accumulation
    • Loop Optimization: @parameter for k in range(size)

      • Unrolls the loop at compile time
      • Enables better instruction scheduling
      • Efficient for small, known matrix sizes
    • Result Writing: out[row, col] = acc

      • Protected by the same guard condition
      • Only valid threads write results
      • Maintains matrix bounds safety

Thread safety and synchronization:

  1. Guard conditions:

    • Input Loading: if row < size and col < size
    • Computation: Same guard ensures thread safety
    • Output Writing: Protected by the same condition
    • Prevents invalid memory access and race conditions
  2. Memory access safety:

    • Shared memory: Accessed only within TPB bounds
    • Global memory: Protected by size checks
    • Output: Guarded writes prevent corruption

Key language features:

  1. LayoutTensor benefits:

    • Direct 2D indexing simplifies code
    • Type safety through element_type
    • Efficient memory layout handling
  2. Shared memory allocation:

    • TensorBuilder for structured allocation
    • Row-major layout matching input tensors
    • Proper alignment for efficient access
  3. Synchronization:

    • barrier() ensures shared memory consistency
    • Proper synchronization between load and compute
    • Thread cooperation within block

Performance optimizations:

  1. Memory Access Efficiency:

    • Single global memory load per element
    • Multiple reuse through shared memory
    • Coalesced memory access patterns
  2. Thread cooperation:

    • Collaborative data loading
    • Shared data reuse
    • Efficient thread synchronization
  3. Computational benefits:

    • Reduced global memory traffic
    • Better cache utilization
    • Improved instruction throughput

This implementation significantly improves performance over the naive version by:

  • Reducing global memory accesses
  • Enabling data reuse through shared memory
  • Using efficient 2D indexing with LayoutTensor
  • Maintaining proper thread synchronization

Tiled Matrix Multiplication

Overview

Implement a kernel that multiplies square matrices \(A\) and \(B\) using tiled matrix multiplication with LayoutTensor. This approach handles large matrices by processing them in smaller chunks (tiles).

Key concepts

  • Matrix tiling with LayoutTensor for efficient computation
  • Multi-block coordination with proper layouts
  • Efficient shared memory usage through TensorBuilder
  • Boundary handling for tiles with LayoutTensor indexing

Configuration

  • Matrix size: \(\text{SIZE_TILED} = 8\)
  • Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • Grid dimensions: \(3 \times 3\) blocks
  • Shared memory: Two \(\text{TPB} \times \text{TPB}\) LayoutTensors per block

Layout configuration:

  • Input A: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • Input B: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • Output: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • Shared Memory: Two TPB × TPB LayoutTensors using TensorBuilder

Tiling strategy

Block organization

Grid Layout (3×3):           Thread Layout per Block (3×3):
[B00][B01][B02]               [T00 T01 T02]
[B10][B11][B12]               [T10 T11 T12]
[B20][B21][B22]               [T20 T21 T22]

Each block processes a tile using LayoutTensor indexing

Tile processing steps

  1. Calculate global and local indices for thread position
  2. Allocate shared memory for A and B tiles
  3. For each tile:
    • Reset shared memory
    • Load tile from matrix A and B
    • Compute partial products
    • Accumulate results in registers
  4. Write final accumulated result

Memory access pattern

Matrix A (8×8)                 Matrix B (8×8)               Matrix C (8×8)
+---+---+---+                  +---+---+---+                +---+---+---+
|T00|T01|T02| ...              |T00|T01|T02| ...            |T00|T01|T02| ...
+---+---+---+                  +---+---+---+                +---+---+---+
|T10|T11|T12|                  |T10|T11|T12|                |T10|T11|T12|
+---+---+---+                  +---+---+---+                +---+---+---+
|T20|T21|T22|                  |T20|T21|T22|                |T20|T21|T22|
+---+---+---+                  +---+---+---+                +---+---+---+
  ...                            ...                          ...

Tile Processing (for computing C[T11]):
1. Load tiles from A and B:
   +---+      +---+
   |A11| ×    |B11|     For each phase k:
   +---+      +---+     C[T11] += A[row, k] × B[k, col]

2. Tile movement:
   Phase 1     Phase 2     Phase 3
   A: [T10]    A: [T11]    A: [T12]
   B: [T01]    B: [T11]    B: [T21]

3. Each thread (i,j) in tile computes:
   C[i,j] = Σ (A[i,k] × B[k,j]) for k in tile width

Synchronization required:
* After loading tiles to shared memory
* After computing each phase

Code to complete

alias SIZE_TILED = 8
alias BLOCKS_PER_GRID_TILED = (3, 3)  # each block convers 3x3 elements
alias THREADS_PER_BLOCK_TILED = (TPB, TPB)
alias layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED)


fn matmul_tiled[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    local_row = thread_idx.x
    local_col = thread_idx.y
    global_row = block_idx.x * TPB + local_row
    global_col = block_idx.y * TPB + local_col
    # FILL ME IN (roughly 20 lines)


View full file: problems/p14/p14.mojo

Tips
  1. Calculate global thread positions from block and thread indices correctly
  2. Clear shared memory before loading new tiles
  3. Load tiles with proper bounds checking
  4. Accumulate results across tiles with proper synchronization

Running the code

To test your solution, run the following command in your terminal:

uv run poe p14 --tiled
pixi run p14 --tiled

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([2240.0, 2296.0, 2352.0, 2408.0, 2464.0, 2520.0, 2576.0, 2632.0, 5824.0, 6008.0, 6192.0, 6376.0, 6560.0, 6744.0, 6928.0, 7112.0, 9408.0, 9720.0, 10032.0, 10344.0, 10656.0, 10968.0, 11280.0, 11592.0, 12992.0, 13432.0, 13872.0, 14312.0, 14752.0, 15192.0, 15632.0, 16072.0, 16576.0, 17144.0, 17712.0, 18280.0, 18848.0, 19416.0, 19984.0, 20552.0, 20160.0, 20856.0, 21552.0, 22248.0, 22944.0, 23640.0, 24336.0, 25032.0, 23744.0, 24568.0, 25392.0, 26216.0, 27040.0, 27864.0, 28688.0, 29512.0, 27328.0, 28280.0, 29232.0, 30184.0, 31136.0, 32088.0, 33040.0, 33992.0])

Solution: Manual tiling

fn matmul_tiled[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    tiled_col = block_idx.x * TPB + thread_idx.x
    tiled_row = block_idx.y * TPB + thread_idx.y
    local_col = thread_idx.x
    local_row = thread_idx.y

    a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()

    var acc: out.element_type = 0

    # Iterate over tiles to compute matrix product
    @parameter
    for tile in range((size + TPB - 1) // TPB):
        # Reset shared memory tiles
        if local_row < TPB and local_col < TPB:
            a_shared[local_row, local_col] = 0
            b_shared[local_row, local_col] = 0

        barrier()

        # Load A tile - global row stays the same, col determined by tile
        if tiled_row < size and (tile * TPB + local_col) < size:
            a_shared[local_row, local_col] = a[
                tiled_row, tile * TPB + local_col
            ]

        # Load B tile - row determined by tile, global col stays the same
        if (tile * TPB + local_row) < size and tiled_col < size:
            b_shared[local_row, local_col] = b[
                tile * TPB + local_row, tiled_col
            ]

        barrier()

        # Matrix multiplication within the tile
        if tiled_row < size and tiled_col < size:

            @parameter
            for k in range(min(TPB, size - tile * TPB)):
                acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write out final result
    if tiled_row < size and tiled_col < size:
        out[tiled_row, tiled_col] = acc


The tiled matrix multiplication implementation demonstrates efficient handling of large matrices \((8 \times 8)\) using small tiles \((3 \times 3)\). Here’s how it works:

  1. Thread indexing setup

    • Global position calculation:
      tiled_row = block_idx.y * TPB + thread_idx.y
      tiled_col = block_idx.x * TPB + thread_idx.x
      
    • Local position in tile:
      local_row = thread_idx.y
      local_col = thread_idx.x
      
  2. Shared memory allocation

    Input matrices (8×8):
    A = [0  1  2  3  4  5  6  7 ]    B = [0  2  4  6  8  10 12 14]
        [8  9  10 11 12 13 14 15]        [16 18 20 22 24 26 28 30]
        [16 17 18 19 20 21 22 23]        [32 34 36 38 40 42 44 46]
        [24 25 26 27 28 29 30 31]        [48 50 52 54 56 58 60 62]
        [32 33 34 35 36 37 38 39]        [64 66 68 70 72 74 76 78]
        [40 41 42 43 44 45 46 47]        [80 82 84 86 88 90 92 94]
        [48 49 50 51 52 53 54 55]        [96 98 100 102 104 106 108 110]
        [56 57 58 59 60 61 62 63]        [112 114 116 118 120 122 124 126]
    
    Shared memory per block (3×3):
    a_shared[TPB, TPB]  b_shared[TPB, TPB]
    
  3. Tile processing loop

    Number of tiles = (8 + 3 - 1) // 3 = 3 tiles
    
    For each tile:
    1. Reset shared memory
    2. Load tile from A and B
    3. Compute partial products
    4. Accumulate in register
    
  4. Memory loading pattern

    • Loading A tile:
      if tiled_row < size and (tile * TPB + local_col) < size:
          a_shared[local_row, local_col] = a[tiled_row, tile * TPB + local_col]
      
    • Loading B tile:
      if (tile * TPB + local_row) < size and tiled_col < size:
          b_shared[local_row, local_col] = b[tile * TPB + local_row, tiled_col]
      
  5. Computation within tile

    For k in range(min(TPB, size - tile * TPB)):
        acc += a_shared[local_row, k] * b_shared[k, local_col]
    
    • Maximizes memory coalescing:
      Coalesced Access (Good):          Non-Coalesced Access (Bad):
      Thread0: [M0][M1][M2][M3]         Thread0: [M0][ ][ ][ ]
      Thread1: [M4][M5][M6][M7]    vs   Thread1: [ ][M1][ ][ ]
      Thread2: [M8][M9][MA][MB]         Thread2: [ ][ ][M2][ ]
      Thread3: [MC][MD][ME][MF]         Thread3: [ ][ ][ ][M3]
      ↓                                 ↓
      1 memory transaction              4 memory transactions
      
      When threads access consecutive memory locations (left), the GPU can combine multiple reads into a single transaction. When threads access scattered locations (right), each access requires a separate transaction, reducing performance.
  6. Synchronization points

    barrier() after:
    1. Shared memory reset
    2. Tile loading
    3. Tile computation
    

Key performance features:

  • Processes 8×8 matrix using 3×3 tiles
  • Uses shared memory for fast tile access
  • Minimizes global memory transactions
  • Handles matrix boundaries correctly
  • Maintains coalesced memory access
  1. Boundary handling:
    if row < size and col < size:
        out[row, col] = acc
    
    • Prevents out-of-bounds access
    • Handles matrix edges
    • Safe result writing

Key optimizations

  1. Layout optimization:

    • Row-major layout for all tensors
    • Efficient 2D indexing
  2. Memory access:

    • Coalesced global memory loads
    • Efficient shared memory usage
  3. Computation:

    • Register-based accumulation i.e. var acc: out.element_type = 0
    • Compile-time loop unrolling via @parameter

This implementation achieves high performance through:

  • Efficient use of LayoutTensor for memory access
  • Optimal tiling strategy
  • Proper thread synchronization
  • Careful boundary handling

Solution: Idiomatic LayoutTensor tiling

from gpu.memory import async_copy_wait_all
from layout.layout_tensor import copy_dram_to_sram_async


fn matmul_idiomatic_tiled[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    # Get the tile of the output matrix `out` that this thread block is responsible for
    out_tile = out.tile[TPB, TPB](block_idx.y, block_idx.x)
    a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    local_row = thread_idx.y
    local_col = thread_idx.x

    var acc: out.element_type = 0

    alias load_a_layout = Layout.row_major(1, TPB)
    alias load_b_layout = Layout.row_major(TPB, 1)
    for idx in range((size + TPB - 1) // TPB):
        a_tile = a.tile[TPB, TPB](block_idx.y, idx)
        b_tile = b.tile[TPB, TPB](idx, block_idx.x)

        copy_dram_to_sram_async[thread_layout=load_a_layout](a_shared, a_tile)
        copy_dram_to_sram_async[thread_layout=load_b_layout](b_shared, b_tile)

        async_copy_wait_all()

        barrier()

        @parameter
        for k in range(TPB):
            acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    if (
        block_idx.y * TPB + local_row < size
        and block_idx.x * TPB + local_col < size
    ):
        out_tile[local_row, local_col] = acc


The idiomatic tiled matrix multiplication leverages Mojo’s LayoutTensor API and asynchronous memory operations for a cleaner implementation:

  1. LayoutTensor tile API

    out_tile = out.tile[TPB, TPB](block_idx.y, block_idx.x)
    a_tile = a.tile[TPB, TPB](block_idx.y, idx)
    b_tile = b.tile[TPB, TPB](idx, block_idx.x)
    

    This directly expresses “get the tile at position (block_idx.y, block_idx.x)” without manual coordinate calculation. See the documentation for more details.

  2. Asynchronous memory operations

    copy_dram_to_sram_async[thread_layout=load_a_layout](a_shared, a_tile)
    copy_dram_to_sram_async[thread_layout=load_b_layout](b_shared, b_tile)
    async_copy_wait_all()
    

    These operations:

    • Launch asynchronous memory transfers that may overlap with computation via copy_dram_to_sram_async
    • Use specialized thread layouts for optimal memory access patterns
    • Eliminate the need for manual memory initialization
  3. Specialized compile-time load layouts

    alias load_a_layout = Layout.row_major(1, TPB)
    alias load_b_layout = Layout.row_major(TPB, 1)
    

    These layouts optimize how threads cooperate during memory transfers:

    • load_a_layout: Each thread loads a slice of a row (coalesced access)
    • load_b_layout: Each thread loads a slice of a column (transposed access)
  4. Efficient thread synchronization

    // Wait for async operations to complete
    async_copy_wait_all()
    // Ensure all threads can see the shared memory contents
    barrier()
    

    The barriers ensure proper synchronization:

    • After memory transfers complete
    • After computation for each tile
  5. Proper boundary handling

    if block_idx.y * TPB + local_row < size and block_idx.x * TPB + local_col < size:
        out_tile[local_row, local_col] = acc
    

    This critical check prevents out-of-bounds writes for blocks at the matrix boundaries.

  6. Tile processing loop

    for idx in range((size + TPB - 1) // TPB):
       // Process one tile
    

    Uses ceiling division to handle matrices whose dimensions aren’t perfect multiples of the tile size.

Performance considerations

The idiomatic implementation maintains the performance benefits of tiling while providing cleaner abstractions:

  1. Memory locality: Exploits spatial and temporal locality through tiling
  2. Coalesced access: Specialized load layouts ensure coalesced memory access patterns
  3. Compute-memory overlap: Potential overlap through asynchronous memory operations
  4. Shared memory efficiency: No redundant initialization of shared memory
  5. Register pressure: Uses accumulation registers for optimal compute throughput

This implementation shows how high-level abstractions can express complex GPU algorithms without sacrificing performance. It’s a prime example of Mojo’s philosophy: combining high-level expressiveness with low-level performance control.

Key differences from manual tiling

FeatureManual TilingIdiomatic Tiling
Memory accessDirect indexing with bounds checksLayoutTensor tile API
Tile loadingExplicit element-by-element copyingAsynchronous bulk transfers
Shared memoryManual initialization (zeroing)Managed by copy functions
Code complexityMore verbose with explicit indexingMore concise with higher-level APIs
Bounds checkingMultiple checks during loading and computingSingle check at final write

The idiomatic approach is not just cleaner but also potentially more performant due to the use of specialized memory layouts and asynchronous operations.

Puzzle 15: 1D Convolution Op

Bridging to Python with MAX Graph

We’re now entering Part III of our GPU puzzle journey: Interfacing with Python via MAX Graph Custom Ops.

In previous puzzles, we’ve learned how to write efficient GPU kernels in Mojo. Now we’ll explore how to:

  • Package these kernels as custom operations that can be called from Python
  • Integrate with the MAX Graph system for accelerated machine learning
  • Bridge the gap between high-level Python APIs and low-level GPU code

This integration allows us to leverage the performance of Mojo GPU kernels while working in familiar Python environments.

Overview

In Puzzle 11, we implemented a 1D convolution kernel that runs efficiently on the GPU. Now we’ll take this kernel and transform it into a custom operation that can be called directly from Python using MAX Graph.

The 1D convolution kernel we’ll be working with is already implemented:

alias TPB = 15
alias BLOCKS_PER_GRID = (2, 1)


fn conv1d_kernel[
    in_layout: Layout,
    out_layout: Layout,
    conv_layout: Layout,
    input_size: Int,
    conv_size: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    input: LayoutTensor[mut=True, dtype, in_layout],
    kernel: LayoutTensor[mut=True, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # first: need to account for padding
    shared_a = tb[dtype]().row_major[TPB + conv_size - 1]().shared().alloc()
    shared_b = tb[dtype]().row_major[conv_size]().shared().alloc()
    if global_i < input_size:
        shared_a[local_i] = input[global_i]

    # second: load elements needed for convolution at block boundary
    if local_i < conv_size - 1:
        # indices from next block
        next_idx = global_i + TPB
        if next_idx < input_size:
            shared_a[TPB + local_i] = input[next_idx]

    if local_i < conv_size:
        shared_b[local_i] = kernel[local_i]

    barrier()

    if global_i < input_size:
        var local_sum: out.element_type = 0

        @parameter
        for j in range(conv_size):
            if local_i + j < TPB + conv_size - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]

        out[global_i] = local_sum


The key aspects of this puzzle include:

  1. Custom op registration: Understanding how to expose Mojo functions to Python via the @compiler.register decorator
  2. Packaging custom ops: Learning how to package Mojo code for use with MAX Graph
  3. Python integration: Calling custom operations from Python through MAX Graph
  4. Cross-language data flow: Managing data types and memory between Python and GPU

This custom operation will:

  • Accept NumPy arrays as input from Python
  • Transfer this data to the GPU
  • Execute our optimized convolution kernel
  • Return the results back to Python

When you complete this puzzle, you’ll have created a seamless bridge between Python’s rich ecosystem and Mojo’s powerful GPU performance.

Code to complete

To complete this puzzle, you only need to fill one line to call the conv1d_kernel:

import compiler
from runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer
from gpu.host import DeviceBuffer


@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    fn execute[
        # The kind of device this will be run on: "cpu" or "gpu"
        target: StaticString,
        input_size: Int,
        conv_size: Int,
        dtype: DType = DType.float32,
    ](
        out: OutputTensor[rank=1],
        input: InputTensor[type = out.type, rank = out.rank],
        kernel: InputTensor[type = out.type, rank = out.rank],
        # the context is needed for some GPU calls
        ctx: DeviceContextPtr,
    ) raises:
        out_tensor = out.to_layout_tensor()
        input_tensor = input.to_layout_tensor()
        kernel_tensor = kernel.to_layout_tensor()
        alias in_layout = input_tensor.layout
        alias out_layout = out_tensor.layout
        alias conv_layout = kernel_tensor.layout

        @parameter
        if target == "gpu":
            gpu_ctx = ctx.get_device_context()
            # making sure the output tensor is zeroed out before the kernel is called
            gpu_ctx.enqueue_memset(
                DeviceBuffer[out.type](
                    gpu_ctx,
                    rebind[UnsafePointer[Scalar[out.type]]](out_tensor.ptr),
                    input_size,
                    owning=False,
                ),
                0,
            )

            # FILL ME IN with 1 line calling our conv1d_kernel

        elif target == "cpu":
            # we can fallback to CPU
            pass
        else:
            raise Error("Unsupported target: " + target)


View full file: problems/p15/op/conv1d.mojo

You can run the puzzle with:

uv run poe p15
pixi run p15

When successful, you should see output similar to:

Input array: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]
Expected result (NumPy calculation): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Compiling 1D convolution graph...
Executing 1D convolution...
1D Convolution result (custom Mojo kernel): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Verification passed: Custom kernel results match NumPy calculation

This indicates that your custom MAX Graph operation correctly implements the 1D convolution algorithm.

Solution

To solve this puzzle, we need to integrate our 1D convolution kernel with the MAX Graph system. The key is to properly call our kernel from the execute method in the Conv1DCustomOp struct.

The solution is:

            gpu_ctx.enqueue_function[
                conv1d_kernel[
                    in_layout, out_layout, conv_layout, input_size, conv_size
                ]
            ](
                out_tensor,
                input_tensor,
                kernel_tensor,
                grid_dim=BLOCKS_PER_GRID,
                block_dim=(TPB, 1),
            )
This single line does several important things:
  1. Calls enqueue_function on the GPU context (gpu_ctx is of type DeviceContext) to schedule our kernel execution
  2. Passes the necessary layout and size information as compile-time parameters
  3. Provides the output, input, and kernel tensors as runtime arguments
  4. Configures the execution grid with the appropriate dimensions

Let’s break down how this works in the larger context:

Python-Mojo integration flow

  1. Python side (problems/p15/p15.py):

    • Creates NumPy arrays for input and kernel
    • Calls conv_1d() function which wraps our operation in MAX Graph
    • Converts NumPy arrays to MAX driver Tensors with Tensor.from_numpy(input).to(device)
    • Loads the custom operation package with custom_extensions=[mojo_kernels]
  2. Graph building:

  3. Custom op registration:

    • The @compiler.register("conv1d") decorator exposes our operation to MAX Graph. See @compiler.register
    • The execute method parameters define the interface (inputs, outputs, context)
    • Input/output tensors are converted to LayoutTensors for use in our kernel
    • Device context manages GPU memory allocation and kernel execution
  4. Kernel execution:

    • When model.execute(…) is called, our conv1d_kernel receives the data
    • GPU thread configuration is set with grid_dim and block_dim
    • Results are transferred back to CPU with result.to(CPU())
    • NumPy verification compares our results with the expected output

Key Components in Detail

  1. Custom Op Structure:

    @compiler.register("conv1d")
    struct Conv1DCustomOp:
        @staticmethod
        fn execute[target: StaticString, input_size: Int, conv_size: Int, dtype: DType = DType.float32](
            out: OutputTensor[rank=1],
            input: InputTensor[type = out.type, rank = out.rank],
            kernel: InputTensor[type = out.type, rank = out.rank],
            ctx: DeviceContextPtr,
        ) raises:
            # Implementation
    
    • target indicates the device type (“gpu” or “cpu”)
    • input_size and conv_size are parameters passed from Python
    • Tensor types ensure correct shape and type checking
    • Return type is raises for proper error handling
  2. Tensor Conversion:

    out_tensor = out.to_layout_tensor()
    input_tensor = input.to_layout_tensor()
    kernel_tensor = kernel.to_layout_tensor()
    
    • MAX Graph tensors are converted to Mojo LayoutTensors
    • This allows our kernel to work with them directly
    • The layouts are extracted for compile-time optimization
  3. Device Context Usage:

    gpu_ctx = ctx.get_device_context()
    gpu_ctx.enqueue_memset(...)  # Zero output buffer
    gpu_ctx.enqueue_function[...](...) # Schedule kernel
    
    • Device context manages GPU resources
    • Memory operations ensure correct buffer state
    • Function enqueueing schedules our kernel for execution

This solution demonstrates the complete flow from Python data through MAX Graph to GPU execution and back, leveraging Mojo’s powerful type system and parametric functions to create efficient, type-safe, accelerated operations.

Understanding MAX Graph custom ops

Check out the follow tutorials for more details:

Custom op registration

The core of creating a custom operation is the @compiler.register decorator and the associated structure:

@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    fn execute[...](
        out: OutputTensor[rank=1],
        input: InputTensor[type = out.type, rank = out.rank],
        kernel: InputTensor[type = out.type, rank = out.rank],
        ctx: DeviceContextPtr,
    ) raises:
        # Implementation here

Key components of the registration:

  • The name passed to the decorator ("conv1d") is what Python code will use to call this operation
  • The struct must have an execute method with the correct signature
  • OutputTensor and InputTensor types define the interface for Python data
  • DeviceContextPtr provides access to the execution environment

Packaging custom ops

Before the custom operation can be used from Python, it needs to be packaged:

mojo package op -o op.mojopkg

This command:

  1. Compiles the Mojo code into a deployable package
  2. Creates the necessary metadata for MAX Graph to understand the operation
  3. Produces a binary artifact (op.mojopkg) that can be loaded by Python

The package must be placed in a location where MAX Graph can find it, typically in a directory accessible to the Python code.

Python integration

On the Python side, here’s how the custom operation is used:

# Path to the directory containing our Mojo operations
mojo_kernels = Path(__file__).parent / "op"

# Configure our graph with the custom conv1d operation
with Graph(
    "conv_1d_graph",
    input_types=[...],
    custom_extensions=[mojo_kernels],  # Load our custom op package
) as graph:
    # Define inputs to the graph
    input_value, kernel_value = graph.inputs

    # Use our custom operation by name
    output = ops.custom(
        name="conv1d",  # Must match the name in @compiler.register
        values=[input_value, kernel_value],
        out_types=[...],
        parameters={
            "input_size": input_tensor.shape[0],
            "conv_size": kernel_tensor.shape[0],
            "dtype": dtype,
        },
    )[0].tensor

The key elements are:

  1. Specifying the path to our custom operations with custom_extensions
  2. Calling ops.custom with the registered operation name
  3. Passing input values and parameters that match our operation’s signature

Puzzle 16: Softmax Op

Overview

In this puzzle, we’ll implement the softmax function as a custom MAX Graph operation. Softmax takes a vector of real numbers and normalizes it into a probability distribution.

Mathematically, the softmax function is defined as:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$$

Where:

  • \(x_i\) is the \(i\)-th element of the input vector
  • \(n\) is the length of the input vector

However, this direct implementation can lead to numerical overflow issues when values are large. To address this, we use a more numerically stable version:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

Our GPU implementation uses parallel reduction for both finding the maximum value and computing the sum of exponentials, making it highly efficient for large vectors.

Key concepts

  • Parallel reduction for efficient maximum and sum calculations
  • Numerical stability through max-subtraction technique
  • Shared memory usage for thread communication
  • Custom MAX Graph operation integration with Python
  • Thread synchronization with barriers

Configuration

  • Vector size: \(\text{SIZE} = 128\)
  • Threads per block: \(\text{TPB} = 128\)
  • Grid dimensions: \(1 \times 1\) block
  • Shared memory: Two shared variables for max and sum

Layout configuration:

  • Input tensor: Layout.row_major(SIZE)
  • Output tensor: Layout.row_major(SIZE)
  • Custom op parameters: {"input_size": input_tensor.shape[0]}

Key aspects of this puzzle include:

  1. Numerical stability: Understanding how to handle potential numerical issues
  2. Parallel reductions: Using shared memory for efficient max and sum calculations
  3. Custom op integration: Completing the Python interface for our Mojo GPU kernel
  4. Testing and verification: Ensuring our implementation matches the expected results

Our softmax custom operation will:

  • Accept NumPy arrays from Python
  • Process them efficiently on the GPU
  • Return normalized probability distributions
  • Match the results of SciPy’s softmax implementation

Code to complete

To complete this puzzle, you need to implement both the GPU and CPU kernels in the Mojo file and complete the graph definition in the Python code.

1. Implement the GPU kernel:

from gpu import thread_idx, block_idx, block_dim, barrier
from gpu.host import DeviceContext, HostBuffer, DeviceBuffer
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
from math import exp
from utils.numerics import max_finite, min_finite


alias SIZE = 128
alias TPB = 128
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias layout = Layout.row_major(SIZE)
alias dtype = DType.float32


fn softmax_gpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[mut=True, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
):
    # FILL IN (roughly 31 lines)
    ...


View full file: problems/p16/op/softmax.mojo

Tips
  1. Use shared memory for both the maximum value and sum to ensure all threads can access these values
  2. Remember to call barrier() at appropriate points to synchronize threads
  3. Implement parallel reduction by having each thread process a portion of the input array
  4. Use a tree-based reduction pattern to minimize thread divergence
  5. Handle out-of-bounds access carefully, especially for large inputs
  6. For numerical stability, calculate \(e^{x_i - max}\) instead of \(e^{x_i}\)

2. Implement the CPU kernel:

fn softmax_cpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[dtype, layout, MutableAnyOrigin],
    input: LayoutTensor[dtype, layout, MutableAnyOrigin],
):
    # FILL IN (roughly 10 lines)
    ...


View full file: problems/p16/op/softmax.mojo

Tips
  1. Create a sequential implementation that follows the same mathematical steps as the GPU version
  2. First find the maximum value across all inputs
  3. Then compute \(e^{x_i - max}\) for each element and accumulate the sum
  4. Finally, normalize by dividing each element by the sum
  5. Use scalar operations since we don’t have parallel threads in the CPU implementation

Test the CPU and GPU kernels

uv run poe p16-test-kernels
pixi run p16-test-kernels

when done correctly you’ll see

Total Discovered Tests: 1

Passed : 1 (100.00%)
Failed : 0 (0.00%)
Skipped: 0 (0.00%)

3. Complete the graph definition:

from pathlib import Path
import numpy as np
from max.driver import CPU, Accelerator, Device, Tensor, accelerator_count
from max.dtype import DType
from max.engine import InferenceSession
from max.graph import DeviceRef, Graph, TensorType, ops
from numpy.typing import NDArray
from scipy.special import softmax as scipy_softmax


def softmax(
    input: NDArray[np.float32],
    session: InferenceSession,
    device: Device,
) -> Tensor:
    dtype = DType.float32
    input_tensor = Tensor.from_numpy(input).to(device)
    mojo_kernels = Path(__file__).parent / "op"

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
        # FILL IN (roughly 4 unformatted lines)
        pass

View full file: problems/p16/p16.py

Tips
  1. Use graph.inputs[0] to access the input tensor passed to the graph
  2. Call ops.custom() with the name matching your registered custom op (“softmax”)
  3. Pass the input tensor as a value to the custom operation
  4. Specify the output type to match the input shape
  5. Include the “input_size” parameter which is required by the kernel
  6. Set graph.outputs to a list containing your operation’s output tensor

You can run the puzzle with:

uv run poe p16
pixi run p16

When successful, you should see output similar to on CPU and GPU:

Input shape: (128,)
First few random input values: [ 1.1810775   0.60472375  0.5718309   0.6644599  -0.08899796]
Compiling softmax graph on Device(type=cpu,id=0)
Executing softmax on Device(type=cpu,id=0)
====================================================================================================
Compiling softmax graph on Device(type=gpu,id=0)
Executing softmax on Device(type=gpu,id=0)
====================================================================================================
First few softmax results on CPU (custom Mojo kernel): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
First few softmax results on GPU (custom Mojo kernel): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
First few expected results (SciPy calculation): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
Verification passed: Custom kernel results match SciPy calculation
Sum of all probabilities on CPU: 1.0
Sum of all probabilities on GPU: 1.0

This indicates that your custom MAX Graph operation correctly implements the softmax algorithm and produces a valid probability distribution.

Solution

To solve this puzzle, we need to implement both the Mojo kernels (GPU and CPU) and the Python graph definition for our softmax custom operation. Similar to what we did in Puzzle 15, we’re creating a bridge between Python’s ecosystem and Mojo’s GPU-accelerated computing capabilities.

The softmax operation we’re implementing is mathematically defined as:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$$

However, to prevent numerical overflow, we use the more stable form:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

GPU kernel implementation:

fn softmax_gpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[mut=True, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
):
    shared_max = tb[dtype]().row_major[TPB]().shared().alloc()
    shared_sum = tb[dtype]().row_major[TPB]().shared().alloc()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    var thread_max: Scalar[dtype] = min_finite[dtype]()
    if global_i < input_size:
        thread_max = rebind[Scalar[dtype]](input[global_i])

    shared_max[local_i] = thread_max
    barrier()

    # Parallel reduction to find max
    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared_max[local_i] = max(
                shared_max[local_i], shared_max[local_i + stride]
            )
        barrier()
        stride = stride // 2

    block_max = shared_max[0]

    var exp_val: Scalar[dtype] = 0.0
    if global_i < input_size:
        exp_val = rebind[Scalar[dtype]](exp(input[global_i] - block_max))
        out[global_i] = exp_val

    shared_sum[local_i] = exp_val
    barrier()

    # Parallel reduction for sum
    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared_sum[local_i] += shared_sum[local_i + stride]
        barrier()
        stride = stride // 2

    block_sum = shared_sum[0]

    # Normalize by sum
    if global_i < input_size:
        out[global_i] = out[global_i] / block_sum


Our GPU implementation implements the numerically stable softmax algorithm with highly optimized parallel reduction techniques. Let's dissect the kernel in detail:

Kernel signature and memory management

fn softmax_gpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[mut=True, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
)

The kernel is parameterized with:

  • Common layout parameter for both input and output tensors
  • Vector size as an Integer parameter
  • Configurable data type with float32 as default
  • Mutable output tensor for in-place computation
  • Non-mutable input tensor (mut=False)

Shared memory allocation

shared_max = tb[dtype]().row_major[TPB]().shared().alloc()
shared_sum = tb[dtype]().row_major[TPB]().shared().alloc()

The kernel allocates two shared memory buffers:

  • shared_max: For parallel maximum finding reduction
  • shared_sum: For parallel sum computation
  • Both use TPB (Threads Per Block = 128) as their size
  • Shared memory provides fast access for all threads within a block

Thread indexing

global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x

Each thread computes:

  • global_i: Its global index in the entire computation space
  • local_i: Its local index within the current thread block This mapping ensures each thread processes exactly one input element.

Maximum-finding phase

var thread_max: Scalar[dtype] = min_finite[dtype]()
if global_i < input_size:
    thread_max = rebind[Scalar[dtype]](input[global_i])

shared_max[local_i] = thread_max
barrier()

This initializes each thread with:

  • The minimum finite value for elements outside the valid range
  • The actual input value for threads that map to valid elements
  • Storage in shared memory for the reduction process
  • A barrier synchronization to ensure all threads complete memory writes

Parallel max reduction

stride = TPB // 2
while stride > 0:
    if local_i < stride:
        shared_max[local_i] = max(shared_max[local_i], shared_max[local_i + stride])
    barrier()
    stride = stride // 2

This implements a parallel tree-reduction pattern:

  1. Start with stride = 64 (half of TPB)
  2. Each active thread compares two values separated by the stride
  3. Store the maximum in the lower index
  4. Synchronize all threads with a barrier
  5. Halve the stride and repeat
  6. After \(\log_2(TPB)\) steps, shared_max[0] contains the global maximum

This logarithmic reduction is significantly faster than a linear scan on large inputs.

Exponentiation with numerical stability

block_max = shared_max[0]

var exp_val: Scalar[dtype] = 0.0
if global_i < input_size:
    exp_val = rebind[Scalar[dtype]](exp(input[global_i] - block_max))
    out[global_i] = exp_val

Each thread:

  1. Reads the global maximum from shared memory
  2. Subtracts it from its input value before taking the exponential
  3. This subtraction is crucial for numerical stability - it prevents overflow
  4. The largest exponent becomes \(e^0 = 1\), and all others are \(e^{negative} < 1\)
  5. Stores the intermediate result in the output buffer

Parallel sum reduction

shared_sum[local_i] = exp_val
barrier()

stride = TPB // 2
while stride > 0:
    if local_i < stride:
        shared_sum[local_i] += shared_sum[local_i + stride]
    barrier()
    stride = stride // 2

The second reduction phase:

  1. Stores all exponential values in shared memory
  2. Uses the same tree-based reduction pattern as for max
  3. But performs addition instead of maximum comparison
  4. After \(\log_2(TPB)\) steps, shared_sum[0] contains the total sum of all exponentials

Final normalization

block_sum = shared_sum[0]

if global_i < input_size:
    out[global_i] = out[global_i] / block_sum

Each thread:

  1. Reads the total sum from shared memory
  2. Divides its exponential value by this sum
  3. Writes the normalized probability to the output buffer
  4. This produces a valid probability distribution that sums to 1

Performance characteristics

The implementation has excellent performance characteristics:

  • Complexity: \(O(\log n)\) for both max and sum calculations vs \(O(n)\) in a sequential approach
  • Memory efficiency: Uses only \(2 \times TPB\) elements of shared memory
  • Work efficiency: Each thread performs approximately \(2 \times \log_2(n)\) operations
  • Load balancing: Each thread handles the same amount of work
  • Synchronization: Uses minimal barriers, only where necessary
  • Memory access: Coalesced global memory access pattern for optimal bandwidth

The algorithm is also numerically robust, handling potential overflow/underflow cases by applying the max-subtraction technique that maintains precision across the wide range of values common in neural network activations.

CPU fallback implementation:

fn softmax_cpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[dtype, layout, MutableAnyOrigin],
    input: LayoutTensor[dtype, layout, MutableAnyOrigin],
):
    var max_val: Scalar[dtype] = min_finite[dtype]()
    for i in range(input_size):
        max_val = max(max_val, rebind[Scalar[dtype]](input[i]))

    var sum_exp: Scalar[dtype] = 0.0
    for i in range(input_size):
        var exp_val = rebind[Scalar[dtype]](exp(input[i] - max_val))
        out[i] = exp_val
        sum_exp += exp_val

    for i in range(input_size):
        out[i] = out[i] / sum_exp


Our CPU implementation provides a sequential fallback that follows the same mathematical approach but is optimized for single-threaded execution. Let's analyze each phase:
  1. Maximum Finding:

    var max_val: Scalar[dtype] = min_finite[dtype]()
    for i in range(input_size):
        max_val = max(max_val, rebind[Scalar[dtype]](input[i]))
    

    We initialize with the minimum finite value and perform a linear scan through the array, keeping track of the maximum value encountered. This has \(O(n)\) complexity but works efficiently on CPU where we don’t have many cores to parallelize across.

  2. Exponential Computation and Summation:

    var sum_exp: Scalar[dtype] = 0.0
    for i in range(input_size):
        var exp_val = rebind[Scalar[dtype]](exp(input[i] - max_val))
        out[i] = exp_val
        sum_exp += exp_val
    

    We compute \(e^{x_i - max}\) for each element, store the result in the output buffer, and accumulate the sum \(\sum_{j=1}^{n} e^{x_j - max}\) in a single pass. This approach minimizes memory operations compared to using separate loops.

  3. Normalization:

    for i in range(input_size):
        out[i] = out[i] / sum_exp
    

    Finally, we normalize each element by dividing by the sum, producing a proper probability distribution according to the softmax formula:

    $$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

The CPU implementation uses the same numerical stability technique (subtracting the maximum) but with sequential operations rather than parallel ones. It’s simpler than the GPU version since it doesn’t need to handle shared memory or thread synchronization, but it’s also less efficient for large inputs.

Both implementations are registered with MAX Graph’s custom operation system through the @compiler.register("softmax") decorator, allowing seamless execution on either device type based on availability.

Python integration:

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
        input_value = graph.inputs[0]

        # The output shape is the same as the input for softmax
        # Note: the name must match the name used in `@compiler.register("softmax")` in op/softmax.mojo
        output = ops.custom(
            name="softmax",
            values=[input_value],
            out_types=[
                TensorType(
                    dtype=input_value.tensor.dtype,
                    shape=input_value.tensor.shape,
                    device=DeviceRef.from_device(device),
                )
            ],
            parameters={
                "input_size": input_tensor.shape[0],
                "dtype": dtype,
            },
        )[0].tensor
        graph.output(output)

The Python integration creates a seamless bridge between NumPy arrays and our optimized Mojo GPU kernel. The implementation consists of several key components:
  1. Graph Setup and Configuration:

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
    

    This creates a computation graph named “softmax_graph” that:

    • Defines the input tensor type with proper dtype and shape
    • Maps the tensor to the target device (CPU or GPU)
    • Loads our custom Mojo operations from the specified directory
    • The custom_extensions parameter is crucial for linking to our Mojo implementation
  2. Custom Operation Configuration:

    output = ops.custom(
        name="softmax",
        values=[input_value],
        out_types=[
            TensorType(
                dtype=input_value.tensor.dtype,
                shape=input_value.tensor.shape,
                device=DeviceRef.from_device(device),
            )
        ],
        parameters={
            "input_size": input_tensor.shape[0],
            "dtype": dtype,
        },
    )[0].tensor
    

    This sets up our custom operation with:

    • Name matching the @compiler.register("softmax") in our Mojo code
    • Input values passed as a list
    • Output type definition matching the input shape and type
    • Parameters required by our kernel, including the vector size and data type
    • We extract the tensor from the first returned element with [0].tensor
  3. Graph Output Definition:

    graph.output(output)
    

    This registers our operation’s result as the graph’s output.

The main script includes comprehensive testing that:

  • Generates random input data: np.random.randn(INPUT_SIZE).astype(np.float32)
  • Calculates expected results with SciPy: scipy_softmax(input_array)
  • Verifies numerical accuracy: np.testing.assert_allclose(..., rtol=1e-5)
  • Confirms the output is a valid probability distribution: np.sum(result.to_numpy())

This implementation showcases the power of MAX Graph for integrating high-performance Mojo kernels with Python’s scientific computing ecosystem, providing both efficiency and ease of use.

Puzzle 17: Attention Op

Overview

In this puzzle, we’ll implement the attention mechanism as a custom MAX Graph operation. Attention is a fundamental building block of modern neural networks, poplularized particularly transformers, that allows models to focus on relevant parts of the input when making predictions.

Mathematically, the attention function is defined as:

$$\Large \text{Attention}(Q, K, V) = \text{softmax}(Q \cdot K^T) \cdot V$$

Where:

  • \(Q\) is the query vector of shape \((d,)\) - represents what we’re looking for
  • \(K\) is the key matrix of shape \((\text{seq_len}, d)\) - represents what’s available to match against
  • \(V\) is the value matrix of shape \((\text{seq_len}, d)\) - represents the information to retrieve
  • The output is a weighted combination vector of shape \((d,)\)

The computation involves three main steps:

  1. Attention Scores: Compute \(Q \cdot K^T\) to measure how well the query matches each key vector
  2. Attention Weights: Apply softmax to convert scores into a probability distribution (weights sum to 1)
  3. Weighted Sum: Combine value vectors using attention weights to produce the final output

Understanding attention: a step-by-step breakdown

Think of attention as a smart lookup mechanism. Given a query (what you’re looking for), attention finds the most relevant information from a collection of key-value pairs:

  1. Step 1 - Similarity Matching: Compare your query \(Q\) against all keys \(K\) to get similarity scores

    • Compute \(Q \cdot K^T\) where each score measures how well \(Q\) matches each key vector
    • Higher scores = better matches
  2. Step 2 - Probability Distribution: Convert raw scores into normalized weights

    • Apply softmax to ensure all weights sum to 1.0
    • This creates a probability distribution over which values to focus on
  3. Step 3 - Weighted Retrieval: Combine values using the attention weights

    • Multiply each value vector by its corresponding weight
    • Sum everything up to get the final output

Real-world analogy: Imagine searching a library. Your query is what you want to find, the book titles are keys, and the book contents are values. Attention computes how relevant each book is to your query, then gives you a summary weighted by relevance.

Visual computation flow

Input:  Q(16,)    K(16,16)    V(16,16)
         ↓           ↓           ↓
Step 1: Q(1,16) @ K^T(16,16) → Scores(1,16)
         ↓
Step 2: softmax(Scores) → Weights(1,16)  [sum = 1.0]
         ↓
Step 3: Weights(1,16) @ V(16,16) → Output(1,16) → reshape → Output(16,)

Key insight: We reshape the query vector \(Q\) from shape \((16,)\) to \((1,16)\) so we can use matrix multiplication instead of manual dot products. This allows us to leverage the highly optimized tiled matmul kernel from Puzzle 14!

Our GPU implementation reuses and combines optimized kernels from previous puzzles:

🔄 Kernel Reuse Strategy: This puzzle demonstrates how to build complex operations by combining proven, optimized kernels from previous puzzles. Rather than writing everything from scratch, we leverage the matmul_idiomatic_tiled from Puzzle 14 and softmax_kernel from Puzzle 16, showcasing the power of modular GPU kernel design.

Key concepts

  • Vector attention mechanism for sequence processing
  • Kernel reuse: Leveraging proven implementations from Puzzle 14 and Puzzle 16
  • Efficient matrix multiplication using shared memory tiling
  • Memory-optimized tensor reshaping to minimize buffer allocation
  • Integration of multiple optimized kernels into a single operation
  • Custom MAX Graph operation with multi-input support
  • CPU fallback implementation for compatibility

Configuration

  • Sequence length: \(\text{SEQ_LEN} = 16\) - number of key/value vectors in our sequence
  • Model dimension: \(\text{D} = 16\) - dimensionality of each vector (query, keys, values)
  • Threads per block: \(\text{TPB} = 16\) - matches SEQ_LEN for optimal softmax performance
  • Grid dimensions: Computed dynamically to handle different matrix sizes efficiently
  • Shared memory: Utilized in transpose, matmul, and softmax kernels for performance

Layout configuration:

  • Query tensor: Layout.row_major(d)
  • Key tensor: Layout.row_major(seq_len, d)
  • Value tensor: Layout.row_major(seq_len, d)
  • Output tensor: Layout.row_major(d)
  • Custom op parameters: {"seq_len": seq_len, "d": d, "dtype": dtype}

Key aspects of this puzzle include:

  1. Multi-kernel orchestration: Combining transpose, matmul, and softmax operations
  2. Memory optimization: Using reshape operations and buffer reuse to minimize allocations
  3. Numerical stability: Leveraging the proven softmax implementation from Puzzle 16
  4. Performance optimization: Using tiled algorithms from Puzzle 14 for all matrix operations
  5. Multi-input operations: Handling three input tensors (Q, K, V) in a single custom op

Our attention custom operation will:

  • Accept query, key, and value tensors from Python
  • Process them efficiently on GPU using optimized kernels
  • Return the attention-weighted output vector
  • Match the results of NumPy reference implementation

Code to complete

To complete this puzzle, we’ll leverage the tiled matmul kernel from Puzzle 14 and the softmax kernel from Puzzle 16. You only need to implement the transpose kernel in the Mojo file using shared memory.

1. Implement the transpose kernel

fn transpose_kernel[
    layout_in: Layout,  # Layout for input matrix (seq_len, d)
    layout_out: Layout,  # Layout for output matrix (d, seq_len)
    rows: Int,
    cols: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[mut=True, dtype, layout_out, MutableAnyOrigin],
    inp: LayoutTensor[mut=False, dtype, layout_in, MutableAnyOrigin],
):
    # FILL ME IN (roughly 18 lines)
    ...


View full file: problems/p17/op/attention.mojo

Tips

Transpose Kernel Implementation Guide:

  1. Shared Memory Setup: Use tb[dtype]().row_major[TPB, TPB]().shared().alloc() to create a TPB×TPB shared memory tile for efficient data exchange between threads

  2. Thread Indexing: Map threads to matrix elements:

    • local_row = thread_idx.y, local_col = thread_idx.x (position within the block)
    • global_row = block_idx.y * TPB + local_row (position in the full matrix)
  3. Two-Phase Operation:

    • Phase 1: Load data from global memory into shared memory with normal indexing
    • Phase 2: Store data from shared memory to global memory with swapped indexing
  4. Critical Synchronization: Call barrier() between loading and storing to ensure all threads have finished loading before any thread starts storing

  5. Transpose Magic: The transpose happens through swapped indexing: shared_tile[local_col, local_row] instead of shared_tile[local_row, local_col]

  6. Boundary Handling: Check bounds when accessing global memory to avoid out-of-bounds reads/writes for matrices that don’t perfectly divide by TPB

  7. Memory Coalescing: This pattern ensures both reads and writes are coalesced for optimal memory bandwidth utilization

2. Orchestrate the attention

            var gpu_ctx = rebind[DeviceContext](ctx[])

            # Define layouts for matrix multiplication
            # Q reshaped to (1, d)
            alias layout_q_2d = Layout.row_major(1, d)
            # K^T is (d, seq_len)
            alias layout_k_t = Layout.row_major(d, seq_len)
            # Scores as (1, seq_len)
            alias layout_scores_2d = Layout.row_major(1, seq_len)
            # Weights as (1, seq_len)
            alias layout_weights_2d = Layout.row_major(1, seq_len)
            # Result as (1, d)
            alias layout_result_2d = Layout.row_major(1, d)

            alias scores_blocks_per_grid = (
                (seq_len + TPB - 1) // TPB,
                (1 + TPB - 1) // TPB,
            )
            alias result_blocks_per_grid = (
                (d + TPB - 1) // TPB,
                (1 + TPB - 1) // TPB,
            )
            alias matmul_threads_per_block = (TPB, TPB)
            alias transpose_blocks_per_grid = (
                (seq_len + TPB - 1) // TPB,
                (d + TPB - 1) // TPB,
            )

            # Allocate minimal temporary buffers - reuse same buffer for different shapes
            k_t_buf = gpu_ctx.enqueue_create_buffer[dtype](
                seq_len * d
            )  # K^T as (d, seq_len)
            scores_weights_buf = gpu_ctx.enqueue_create_buffer[dtype](
                seq_len
            )  # Reused for scores and weights

            k_t = LayoutTensor[mut=True, dtype, layout_k_t, MutableAnyOrigin](
                k_t_buf.unsafe_ptr()
            )

            # Step 1: Reshape Q from (d,) to (1, d) - no buffer needed
            # FILL ME IN 1 line

            # Step 2: Transpose K from (seq_len, d) to K^T (d, seq_len)
            # FILL ME IN 1 function call

            # Step 3: Compute attention scores using matmul: Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len)
            # GPU: Uses matrix multiplication to compute all Q · K[i] scores in parallel
            # Reuse scores_weights_buf as (1, seq_len) for scores
            # FILL ME IN 2 lines

            # Step 4: Reshape scores from (1, seq_len) to (seq_len,) for softmax
            # FILL ME IN 1 line

            # Step 5: Apply softmax to get attention weights
            # FILL ME IN 1 function call

            # Step 6: Reshape weights from (seq_len,) to (1, seq_len) for final matmul
            # FILL ME IN 1 line

            # Step 7: Compute final result using matmul: weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d)
            # Reuse out_tensor reshaped as (1, d) for result
            # FILL ME IN 2 lines

View full file: problems/p17/op/attention.mojo

Test the kernels

uv run poe p17
pixi run p17

When successful, you should see output similar to on CPU and GPU:

Input shapes: Q=(16,), K=(16, 16), V=(16, 16)
Sample Q values: [ 0.04967142 -0.01382643  0.06476886  0.15230298 -0.02341534]
Sample K[0] values: [-0.10128311  0.03142473 -0.09080241 -0.14123037  0.14656489]
Sample V[0] values: [ 0.11631638  0.00102331 -0.09815087  0.04621035  0.01990597]

================================================================================
STEP-BY-STEP VECTOR ATTENTION COMPUTATION DEBUG
================================================================================

1. INPUT SHAPES:
   Q shape: (16,) (query vector)
   K shape: (16, 16) (key matrix)
   V shape: (16, 16) (value matrix)
   Q[:5]: [ 0.04967142 -0.01382643  0.06476886  0.15230298 -0.02341534]

2. ATTENTION SCORES (K[i] · Q):
   Scores shape: (16,)
   Scores[:5]: [-0.03479404 -0.01563787  0.04834607  0.06764711  0.04001468]
   Min: -0.061636, Max: 0.067647
   Manual verification:
     Q · K[0] = K[0] · Q = -0.034794 (computed: -0.034794)
     Q · K[1] = K[1] · Q = -0.015638 (computed: -0.015638)
     Q · K[2] = K[2] · Q = 0.048346 (computed: 0.048346)

3. SOFTMAX:
   Max score: 0.067647
   Attention weights shape: (16,)
   Attention weights[:5]: [0.05981331 0.06097015 0.06499878 0.0662655  0.06445949]
   Sum: 1.000000 (should be 1.0)

4. WEIGHTED SUM OF VALUES:
   Output shape: (16,)
   Output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
   Output norm: 0.092764
   Manual output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
   Match: True

================================================================================
TESTING INDIVIDUAL OPERATIONS
================================================================================

Test 1: Vector Dot Product
a · b = 3.000000

Test 2: Matrix-Vector Multiplication
M @ v = [ 3.  7. 11.]

Test 3: Softmax
Input: [1. 2. 3. 4.]
Softmax: [0.0320586  0.08714432 0.2368828  0.6439143 ]
Sum: 1.000000

================================================================================
TESTING FULL ATTENTION
================================================================================
Compiling attention graph on Device(type=cpu,id=0)
Executing attention on Device(type=cpu,id=0)
====================================================================================================

CPU attention output[:5]: [-0.00935538 -0.02434331  0.00306551  0.02346884  0.019306  ]
CPU matches NumPy: True
Compiling attention graph on Device(type=gpu,id=0)
Executing attention on Device(type=gpu,id=0)
====================================================================================================

GPU attention output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
Expected output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
GPU matches NumPy: True

================================================================================
FINAL VERIFICATION
================================================================================
✓ CPU implementation PASSED
✓ GPU implementation PASSED

Output vector norms:
  CPU: 0.092764
  GPU: 0.092764
  Expected: 0.092764

This indicates that your custom MAX Graph operation correctly implements the attention algorithm and produces results matching the NumPy reference implementation.

Solution

To solve this puzzle, we need to implement the transpose kernel in Mojo and complete the Python graph definition for our attention custom operation. This puzzle builds upon concepts from previous puzzles, combining tiled matrix multiplication from Puzzle 14 and softmax from Puzzle 16 into a complete attention mechanism.

Reused kernels

Our implementation directly incorporates these proven kernels:

  1. matmul_idiomatic_tiled from Puzzle 14 - Powers both \(Q \times K^T\) and \(\text{weights} \times V\) operations
  2. softmax_kernel from Puzzle 16 - Provides numerically stable attention weight computation

This exemplifies modular GPU architecture: complex neural network operations built by orchestrating proven, optimized components rather than monolithic implementations.

The attention operation follows the canonical mathematical definition:

$$\Large \text{Attention}(Q, K, V) = \text{softmax}(Q \cdot K^T) \cdot V$$

Breaking down the math:

  • \(Q \cdot K^T\): Query-key similarity scores of shape: \((1, \text{seq_len})\)
  • \(\text{softmax}(\cdot)\): Normalize scores to probabilities of shape: \((1, \text{seq_len})\)
  • \(\text{weights} \cdot V\): Weighted combination of values of shape: \((1, d)\)

This involves several computational steps that we optimize using GPU kernels from previous puzzles.

1. Transpose kernel implementation:

fn transpose_kernel[
    layout_in: Layout,  # Layout for input matrix (seq_len, d)
    layout_out: Layout,  # Layout for output matrix (d, seq_len)
    rows: Int,
    cols: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[mut=True, dtype, layout_out, MutableAnyOrigin],
    inp: LayoutTensor[mut=False, dtype, layout_in, MutableAnyOrigin],
):
    """Transpose matrix using shared memory tiling for coalesced access."""
    shared_tile = tb[dtype]().row_major[TPB, TPB]().shared().alloc()

    local_row = thread_idx.y
    local_col = thread_idx.x

    global_row = block_idx.y * TPB + local_row
    global_col = block_idx.x * TPB + local_col

    if global_row < rows and global_col < cols:
        shared_tile[local_row, local_col] = inp[global_row, global_col]
    else:
        shared_tile[local_row, local_col] = 0.0

    barrier()

    out_row = block_idx.x * TPB + local_row
    out_col = block_idx.y * TPB + local_col

    # Store data from shared memory to global memory (coalesced write)
    # Note: we transpose the shared memory access pattern
    if out_row < cols and out_col < rows:
        out[out_row, out_col] = shared_tile[local_col, local_row]


The transpose kernel uses shared memory tiling to achieve coalesced memory access patterns. Key implementation details:

Critical transpose pattern

# Load with normal indexing
shared_tile[local_row, local_col] = inp[global_row, global_col]
barrier()
# Store with swapped indexing for transpose
out[out_row, out_col] = shared_tile[local_col, local_row]

The transpose happens through swapped indexing in shared memory access ([local_col, local_row] instead of [local_row, local_col]) and swapped block coordinates for output positioning. This ensures both reads and writes remain coalesced while achieving the transpose operation.

2. GPU kernel orchestration:


            # Step 1: Reshape Q from (d,) to (1, d) - no buffer needed
            q_2d = q_tensor.reshape[layout_q_2d]()

            # Step 2: Transpose K from (seq_len, d) to K^T (d, seq_len)
            gpu_ctx.enqueue_function[
                transpose_kernel[layout_k, layout_k_t, seq_len, d, dtype]
            ](
                k_t,
                k_tensor,
                grid_dim=transpose_blocks_per_grid,
                block_dim=matmul_threads_per_block,
            )

            # Step 3: Compute attention scores using matmul: Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len)
            # This computes Q · K^T[i] = Q · K[i] for each column i of K^T (which is row i of K)
            # Reuse scores_weights_buf as (1, seq_len) for scores
            scores_2d = LayoutTensor[
                mut=True, dtype, layout_scores_2d, MutableAnyOrigin
            ](scores_weights_buf.unsafe_ptr())
            gpu_ctx.enqueue_function[
                matmul_idiomatic_tiled[layout_q_2d, 1, seq_len, d, dtype]
            ](
                scores_2d,
                q_2d,
                k_t,
                grid_dim=scores_blocks_per_grid,
                block_dim=matmul_threads_per_block,
            )

            # Step 4: Reshape scores from (1, seq_len) to (seq_len,) for softmax
            weights = scores_2d.reshape[layout_scores]()

            # Step 5: Apply softmax to get attention weights
            gpu_ctx.enqueue_function[
                softmax_kernel[layout_scores, seq_len, dtype]
            ](
                weights,
                weights,
                grid_dim=(1, 1),
                block_dim=(seq_len, 1),
            )

            # Step 6: Reshape weights from (seq_len,) to (1, seq_len) for final matmul
            weights_2d = weights.reshape[layout_weights_2d]()

            # Step 7: Compute final result using matmul: weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d)
            # Reuse out_tensor reshaped as (1, d) for result
            result_2d = out_tensor.reshape[layout_result_2d]()
            gpu_ctx.enqueue_function[
                matmul_idiomatic_tiled[layout_weights_2d, 1, d, seq_len, dtype]
            ](
                result_2d,
                weights_2d,
                v_tensor,
                grid_dim=result_blocks_per_grid,
                block_dim=matmul_threads_per_block,
            )

The GPU orchestration demonstrates sophisticated kernel chaining and zero-copy memory optimization:

Advanced memory optimization strategies

# Zero-copy reshaping - no data movement, just reinterpret tensor shape
q_2d = q_tensor.reshape[layout_q_2d]()
# Aggressive buffer reuse - same memory, different interpretations
weights = scores_2d.reshape[layout_scores]()

The implementation achieves maximum memory efficiency through:

  • Zero-copy reshaping: Reinterpreting tensor shapes without moving data in memory
  • Intelligent buffer reuse: The same scores_weights_buf serves dual purposes as both scores \((1,\text{seq\_len})\) and weights \((\text{seq\_len},)\)
  • Minimal allocations: Only 2 temporary buffers power the entire attention operation
  • Memory coalescing: All operations maintain optimal memory access patterns

Strategic kernel reuse pattern

  • Steps 3 & 7: Both leverage matmul_idiomatic_tiled from Puzzle 14
    • Step 3: \(Q \times K^T\) → attention scores computation \((1,d) \times (d,\text{seq_len}) \rightarrow (1,\text{seq_len})\)
    • Step 7: \(\text{weights} \times V\) → final weighted output \((1,\text{seq_len}) \times (\text{seq_len},d) \rightarrow (1,d)\)
  • Step 5: Employs softmax_kernel from Puzzle 16
    • Converts raw scores into normalized probability distribution
    • Ensures numerical stability through max subtraction and parallel reduction
    • Guarantees \(\sum_{i} \text{weights}[i] = 1.0\)

This exemplifies modular GPU architecture: complex neural network operations built by orchestrating proven, optimized kernels rather than monolithic implementations!

Key implementation insights

Memory optimization strategy

The implementation achieves minimal memory allocation through aggressive buffer reuse:

# Only 2 temporary buffers needed for the entire operation
k_t_buf = gpu_ctx.enqueue_create_buffer[dtype](seq_len * d)
scores_weights_buf = gpu_ctx.enqueue_create_buffer[dtype](seq_len)

Key optimization insights:

  • The same scores_weights_buf is reused for both attention scores and weights through reshape operations
  • Zero-copy tensor reshaping eliminates unnecessary data movement

Kernel reuse architecture

This puzzle showcases modular kernel design by combining three specialized kernels:

  • matmul_idiomatic_tiled (used twice) - Powers both \(Q \times K^T\) and \(\text{weights} \times V\) operations
  • softmax_kernel - Provides numerically stable attention weight computation with parallel reduction
  • transpose_kernel - Enables efficient \(K^T\) computation with coalesced memory access

Architectural benefits:

  • Composability: Complex operations built from proven components
  • Maintainability: Each kernel has a single, well-defined responsibility
  • Performance: Leverages highly optimized implementations from previous puzzles
  • Scalability: Modular design enables easy extension to larger attention mechanisms

The implementation demonstrates that sophisticated neural network operations can be built by orchestrating simpler, well-tested GPU kernels rather than writing monolithic implementations.

Bonus challenges

Challenge I: Advanced softmax implementations

This challenge extends Puzzle 16: Softmax Op

Here are some advanced challenges to extend your softmax implementation:

1. Large-scale softmax: Handling TPB < SIZE

When the input size exceeds the number of threads per block (TPB < SIZE), our current implementation fails because a single block cannot process the entire array. Two approaches to solve this:

1.1 Buffer reduction

  • Store block-level results (max and sum) in device memory
  • Use a second kernel to perform reduction across these intermediate results
  • Implement a final normalization pass that uses the global max and sum

1.2 Two-pass softmax

  • First pass: Each block calculates its local max value
  • Synchronize and compute global max
  • Second pass: Calculate \(e^{x-max}\) and local sum
  • Synchronize and compute global sum
  • Final pass: Normalize using global sum

2. Batched softmax

Implement softmax for a batch of vectors (2D input tensor) with these variants:

  • Row-wise softmax: Apply softmax independently to each row
  • Column-wise softmax: Apply softmax independently to each column
  • Compare performance differences between these implementations

Challenge II: Advanced attention mechanisms

This challenge extends Puzzle 17: Attention Op

Building on the vector attention implementation, here are advanced challenges that push the boundaries of attention mechanisms:

1. Larger sequence lengths

Extend the attention mechanism to handle longer sequences using the existing kernels:

1.1 Sequence length scaling

  • Modify the attention implementation to handle SEQ_LEN = 32 and SEQ_LEN = 64
  • Update the TPB (threads per block) parameter accordingly
  • Ensure the transpose kernel handles the larger matrix sizes correctly

1.2 Dynamic sequence lengths

  • Implement attention that can handle variable sequence lengths at runtime
  • Add bounds checking in the kernels to handle sequences shorter than SEQ_LEN
  • Compare performance with fixed vs. dynamic sequence length handling

2. Batched vector attention

Extend to process multiple attention computations simultaneously:

2.1 Batch processing

  • Modify the attention operation to handle multiple query vectors at once
  • Input shapes: Q(batch_size, d), K(seq_len, d), V(seq_len, d)
  • Output shape: (batch_size, d)
  • Reuse the existing kernels with proper indexing

2.2 Memory optimization for batches

  • Minimize memory allocations by reusing buffers across batch elements
  • Compare performance with different batch sizes (2, 4, 8)
  • Analyze memory usage patterns

🚀 Part V: Mojo Functional Patterns - High-Level GPU Programming

Overview

Welcome to Part V: Mojo Functional Patterns! This section introduces you to Mojo’s revolutionary approach to GPU programming through functional patterns that abstract away low-level complexity while delivering exceptional performance. You’ll master the art of writing clean, efficient parallel code that scales across thousands of GPU threads.

What you’ll achieve: Transform from manual GPU kernel programming to high-level functional patterns that automatically handle vectorization, memory optimization, and performance tuning.

Key insight: Modern GPU programming doesn’t require sacrificing elegance for performance - Mojo’s functional patterns give you both.

What you’ll learn

🧠 GPU execution hierarchy

Understand the fundamental relationship between GPU threads and SIMD operations:

GPU Device
├── Grid (your entire problem)
│   ├── Block 1 (group of threads, shared memory)
│   │   ├── Warp 1 (32 threads, lockstep execution) --> We'll learn in Part VI
│   │   │   ├── Thread 1 → SIMD
│   │   │   ├── Thread 2 → SIMD
│   │   │   └── ... (32 threads total)
│   │   └── Warp 2 (32 threads)
│   └── Block 2 (independent group)

What Mojo abstracts for you:

  • Grid/Block configuration automatically calculated
  • Warp management handled transparently
  • Thread scheduling optimized automatically
  • Memory hierarchy optimization built-in

💡 Note: While this Part focuses on functional patterns, warp-level programming and advanced GPU memory management will be covered in detail in Part VI.

Four fundamental patterns

Master the complete spectrum of GPU functional programming:

  1. Elementwise: Maximum parallelism with automatic SIMD vectorization
  2. Tiled: Memory-efficient processing with cache optimization
  3. Manual vectorization: Expert-level control over SIMD operations
  4. Mojo vectorize: Safe, automatic vectorization with bounds checking

🎯 Performance patterns you’ll recognize

Problem: Add two 1024-element vectors (SIZE=1024, SIMD_WIDTH=4)

Elementwise:     256 threads × 1 SIMD op   = High parallelism
Tiled:           32 threads  × 8 SIMD ops  = Cache optimization
Manual:          8 threads   × 32 SIMD ops = Maximum control
Mojo vectorize:  32 threads  × 8 SIMD ops  = Automatic safety

📊 Real performance insights

Learn to interpret empirical benchmark results:

Benchmark Results (SIZE=1,048,576):
elementwise:        11.34ms  ← Maximum parallelism wins at scale
tiled:              12.04ms  ← Good balance of locality and parallelism
manual_vectorized:  15.75ms  ← Complex indexing hurts simple operations
vectorized:         13.38ms  ← Automatic optimization overhead

Prerequisites

Before diving into functional patterns, ensure you’re comfortable with:

  • Basic GPU concepts: Memory hierarchy, thread execution, SIMD operations
  • Mojo fundamentals: Parameter functions, compile-time specialization, capturing semantics
  • LayoutTensor operations: Loading, storing, and tensor manipulation
  • GPU memory management: Buffer allocation, host-device synchronization

Learning path

🔰 1. Elementwise operations

Elementwise - Basic GPU Functional Operations

Start with the foundation: automatic thread management and SIMD vectorization.

What you’ll master:

  • Functional GPU programming with elementwise
  • Automatic SIMD vectorization within GPU threads
  • LayoutTensor operations for safe memory access
  • Capturing semantics in nested functions

Key pattern:

elementwise[add_function, SIMD_WIDTH, target="gpu"](total_size, ctx)

2. Tiled processing

Tile - Memory-Efficient Tiled Processing

Build on elementwise with memory-optimized tiling patterns.

What you’ll master:

  • Tile-based memory organization for cache optimization
  • Sequential SIMD processing within tiles
  • Memory locality principles and cache-friendly access patterns
  • Thread-to-tile mapping vs thread-to-element mapping

Key insight: Tiling trades parallel breadth for memory locality - fewer threads each doing more work with better cache utilization.

🔧 3. Advanced vectorization

Vectorization - Fine-Grained SIMD Control

Explore manual control and automatic vectorization strategies.

What you’ll master:

  • Manual SIMD operations with explicit index management
  • Mojo’s vectorize function for safe, automatic vectorization
  • Chunk-based memory organization for optimal SIMD alignment
  • Performance trade-offs between manual control and safety

Two approaches:

  • Manual: Direct control, maximum performance, complex indexing
  • Mojo vectorize: Automatic optimization, built-in safety, clean code

🧠 4. Threading vs SIMD concepts

GPU Threading vs SIMD - Understanding the Execution Hierarchy

Understand the fundamental relationship between parallelism levels.

What you’ll master:

  • GPU threading hierarchy and hardware mapping
  • SIMD operations within GPU threads
  • Pattern comparison and thread-to-work mapping
  • Choosing the right pattern for different workloads

Key insight: GPU threads provide the parallelism structure, while SIMD operations provide the vectorization within each thread.

📊 5. Performance benchmarking in Mojo

Benchmarking in Mojo

Learn to measure, analyze, and optimize GPU performance scientifically.

What you’ll master:

  • Mojo’s built-in benchmarking framework
  • GPU-specific timing and synchronization challenges
  • Parameterized benchmark functions with compile-time specialization
  • Empirical performance analysis and pattern selection

Critical technique: Using keep() to prevent compiler optimization of benchmarked code.

Getting started

Ready to transform your GPU programming skills? Start with the elementwise pattern and work through each section systematically. Each puzzle builds on the previous concepts while introducing new levels of sophistication.

💡 Success tip: Focus on understanding the why behind each pattern, not just the how. The conceptual framework you develop here will serve you throughout your GPU programming career.

🎯 Learning objective: By the end of Part V, you’ll think in terms of functional patterns rather than low-level GPU mechanics, enabling you to write more maintainable, performant, and portable GPU code.

Ready to begin? Start with Elementwise Operations and discover the power of functional GPU programming!

🔰 Elementwise - Basic GPU Functional Operations

Implement a kernel that adds two vectors element-wise using Mojo’s functional elementwise pattern. Each thread will process multiple SIMD elements automatically, demonstrating how modern GPU programming abstracts away low-level details while maintaining high performance.

Key insight: The elementwise function automatically handles thread management, SIMD vectorization, and memory coalescing for you.

Key concepts

In this puzzle, you’ll master:

  • Functional GPU programming with elementwise
  • Automatic SIMD vectorization within GPU threads
  • LayoutTensor operations for safe memory access
  • GPU thread hierarchy vs SIMD operations
  • Capturing semantics in nested functions

The mathematical operation is simple element-wise addition: \[\Large \text{out}[i] = a[i] + b[i]\]

But the implementation teaches fundamental patterns for all GPU functional programming in Mojo.

Configuration

  • Vector size: SIZE = 1024
  • Data type: DType.float32
  • SIMD width: Target-dependent (determined by GPU architecture and data type)
  • Layout: Layout.row_major(SIZE) (1D row-major)

Code to complete

alias SIZE = 1024
alias rank = 1
alias layout = Layout.row_major(SIZE)
alias dtype = DType.float32
alias SIMD_WIDTH = simdwidthof[dtype, target = _get_gpu_target()]()


fn elementwise_add[
    layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn add[
        simd_width: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]
        print("idx:", idx)
        # FILL IN (2 to 4 lines)

    elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx)


View full file: problems/p20/p20.mojo

Tips

1. Understanding the function structure

The elementwise function expects a nested function with this exact signature:

@parameter
@always_inline
fn your_function[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
    # Your implementation here

Why each part matters:

  • @parameter: Enables compile-time specialization for optimal GPU code generation
  • @always_inline: Forces inlining to eliminate function call overhead in GPU kernels
  • capturing: Allows access to variables from the outer scope (the input/output tensors)
  • IndexList[rank]: Provides multi-dimensional indexing (rank=1 for vectors, rank=2 for matrices)

2. Index extraction and SIMD processing

idx = indices[0]  # Extract linear index for 1D operations

This idx represents the starting position for a SIMD vector, not a single element. If SIMD_WIDTH=4 (GPU-dependent), then:

  • Thread 0 processes elements [0, 1, 2, 3] starting at idx=0
  • Thread 1 processes elements [4, 5, 6, 7] starting at idx=4
  • Thread 2 processes elements [8, 9, 10, 11] starting at idx=8
  • And so on…

3. SIMD loading pattern

a_simd = a.load[simd_width](idx, 0)  # Load 4 consecutive floats (GPU-dependent)
b_simd = b.load[simd_width](idx, 0)  # Load 4 consecutive floats (GPU-dependent)

The second parameter 0 is the dimension offset (always 0 for 1D vectors). This loads a vectorized chunk of data in a single operation. The exact number of elements loaded depends on your GPU’s SIMD capabilities.

4. Vector arithmetic

result = a_simd + b_simd  # SIMD addition of 4 elements simultaneously (GPU-dependent)

This performs element-wise addition across the entire SIMD vector in parallel - much faster than 4 separate scalar additions.

5. SIMD storing

out.store[simd_width](idx, 0, result)  # Store 4 results at once (GPU-dependent)

Writes the entire SIMD vector back to memory in one operation.

6. Calling the elementwise function

elementwise[your_function, SIMD_WIDTH, target="gpu"](total_size, ctx)
  • total_size should be a.size() to process all elements
  • The GPU automatically determines how many threads to launch: total_size // SIMD_WIDTH

7. Key debugging insight

Notice the print("idx:", idx) in the template. When you run it, you’ll see:

idx: 0, idx: 4, idx: 8, idx: 12, ...

This shows that each thread handles a different SIMD chunk, automatically spaced by SIMD_WIDTH (which is GPU-dependent).

Running the code

To test your solution, run the following command in your terminal:

uv run poe p20 --elementwise
pixi run p20 --elementwise

Your output will look like this if the puzzle isn’t solved yet:

SIZE: 1024
simd_width: 4
...
idx: 404
idx: 408
idx: 412
idx: 416
...

out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

Solution

fn elementwise_add[
    layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn add[
        simd_width: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]
        # Note: This is thread-local SIMD - each thread processes its own vector of data
        # we'll later better see this hierarchy in Mojo:
        # SIMD within threads, warp across threads, block across warps
        a_simd = a.load[simd_width](idx, 0)
        b_simd = b.load[simd_width](idx, 0)
        ret = a_simd + b_simd
        # print(
        #     "idx:", idx, ", a_simd:", a_simd, ", b_simd:", b_simd, " sum:", ret
        # )
        out.store[simd_width](idx, 0, ret)

    elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx)


The elementwise functional pattern in Mojo demonstrates several fundamental concepts for modern GPU programming:

1. Functional abstraction philosophy

The elementwise function represents a paradigm shift from traditional GPU programming:

Traditional CUDA/HIP approach:

// Manual thread management
idx = thread_idx.x + block_idx.x * block_dim.x
if idx < size:
    out[idx] = a[idx] + b[idx];  // Scalar operation

Mojo functional approach:

// Automatic management + SIMD vectorization
elementwise[add_function, simd_width, target="gpu"](size, ctx)

What elementwise abstracts away:

  • Thread grid configuration: No need to calculate block/grid dimensions
  • Bounds checking: Automatic handling of array boundaries
  • Memory coalescing: Optimal memory access patterns built-in
  • SIMD orchestration: Vectorization handled transparently
  • GPU target selection: Works across different GPU architectures

2. Deep dive: nested function architecture

@parameter
@always_inline
fn add[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:

Parameter Analysis:

  • @parameter: This decorator enables compile-time specialization. The function is generated separately for each unique simd_width and rank, allowing aggressive optimization.
  • @always_inline: Critical for GPU performance - eliminates function call overhead by embedding the code directly into the kernel.
  • capturing: Enables lexical scoping - the inner function can access variables from the outer scope without explicit parameter passing.
  • IndexList[rank]: Provides dimension-agnostic indexing - the same pattern works for 1D vectors, 2D matrices, 3D tensors, etc.

3. SIMD execution model deep dive

idx = indices[0]                          // Linear index: 0, 4, 8, 12... (GPU-dependent spacing)
a_simd = a.load[simd_width](idx, 0)      // Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
b_simd = b.load[simd_width](idx, 0)      // Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
ret = a_simd + b_simd                    // SIMD: 4 additions in parallel (GPU-dependent)
out.store[simd_width](idx, 0, ret)       // Store: 4 results simultaneously (GPU-dependent)

Execution Hierarchy Visualization:

GPU Architecture:
├── Grid (entire problem)
│   ├── Block 1 (multiple warps)
│   │   ├── Warp 1 (32 threads) --> We'll learn about Warp in the next Part VI
│   │   │   ├── Thread 1 → SIMD[4 elements]  ← Our focus (GPU-dependent width)
│   │   │   ├── Thread 2 → SIMD[4 elements]
│   │   │   └── ...
│   │   └── Warp 2 (32 threads)
│   └── Block 2 (multiple warps)

For a 1024-element vector with SIMD_WIDTH=4 (example GPU):

  • Total SIMD operations needed: 1024 ÷ 4 = 256
  • GPU launches: 256 threads (1024 ÷ 4)
  • Each thread processes: Exactly 4 consecutive elements
  • Memory bandwidth: SIMD_WIDTH× improvement over scalar operations

Note: SIMD width varies by GPU architecture (e.g., 4 for some GPUs, 8 for RTX 4090, 16 for A100).

4. Memory access pattern analysis

a.load[simd_width](idx, 0)  // Coalesced memory access

Memory Coalescing Benefits:

  • Sequential access: Threads access consecutive memory locations
  • Cache optimization: Maximizes L1/L2 cache hit rates
  • Bandwidth utilization: Achieves near-theoretical memory bandwidth
  • Hardware efficiency: GPU memory controllers optimized for this pattern

Example for SIMD_WIDTH=4 (GPU-dependent):

Thread 0: loads a[0:4]   → Memory bank 0-3
Thread 1: loads a[4:8]   → Memory bank 4-7
Thread 2: loads a[8:12]  → Memory bank 8-11
...
Result: Optimal memory controller utilization

5. Performance characteristics & optimization

Computational Intensity Analysis (for SIMD_WIDTH=4):

  • Arithmetic operations: 1 SIMD addition per 4 elements
  • Memory operations: 2 SIMD loads + 1 SIMD store per 4 elements
  • Arithmetic intensity: 1 add ÷ 3 memory ops = 0.33 (memory-bound)

Why This Is Memory-Bound:

Memory bandwidth >>> Compute capability for simple operations

Optimization Implications:

  • Focus on memory access patterns rather than arithmetic optimization
  • SIMD vectorization provides the primary performance benefit
  • Memory coalescing is critical for performance
  • Cache locality matters more than computational complexity

6. Scaling and adaptability

Automatic Hardware Adaptation:

alias SIMD_WIDTH = simdwidthof[dtype, target = _get_gpu_target()]()
  • GPU-specific optimization: SIMD width adapts to hardware (e.g., 4 for some cards, 8 for RTX 4090, 16 for A100)
  • Data type awareness: Different SIMD widths for float32 vs float16
  • Compile-time optimization: Zero runtime overhead for hardware detection

Scalability Properties:

  • Thread count: Automatically scales with problem size
  • Memory usage: Linear scaling with input size
  • Performance: Near-linear speedup until memory bandwidth saturation

7. Advanced insights: why this pattern matters

Foundation for Complex Operations: This elementwise pattern is the building block for:

  • Reduction operations: Sum, max, min across large arrays
  • Broadcast operations: Scalar-to-vector operations
  • Complex transformations: Activation functions, normalization
  • Multi-dimensional operations: Matrix operations, convolutions

Compared to Traditional Approaches:

// Traditional: Error-prone, verbose, hardware-specific
__global__ void add_kernel(float* out, float* a, float* b, int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        out[idx] = a[idx] + b[idx];  // No vectorization
    }
}

// Mojo: Safe, concise, automatically vectorized
elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx)

Benefits of Functional Approach:

  • Safety: Automatic bounds checking prevents buffer overruns
  • Portability: Same code works across GPU vendors/generations
  • Performance: Compiler optimizations often exceed hand-tuned code
  • Maintainability: Clean abstractions reduce debugging complexity
  • Composability: Easy to combine with other functional operations

This pattern represents the future of GPU programming - high-level abstractions that don’t sacrifice performance, making GPU computing accessible while maintaining optimal efficiency.

Next Steps

Once you’ve mastered elementwise operations, you’re ready for:

💡 Key Takeaway: The elementwise pattern demonstrates how Mojo combines functional programming elegance with GPU performance, automatically handling vectorization and thread management while maintaining full control over the computation.

⚡ Tile - Memory-Efficient Tiled Processing

Overview

Building on the elementwise pattern, this puzzle introduces tiled processing - a fundamental technique for optimizing memory access patterns and cache utilization on GPUs. Instead of each thread processing individual SIMD vectors across the entire array, tiling organizes data into smaller, manageable chunks that fit better in cache memory.

You’ve already seen tiling in action with Puzzle 14’s tiled matrix multiplication, where we used tiles to process large matrices efficiently. Here, we apply the same tiling principles to vector operations, demonstrating how this technique scales from 2D matrices to 1D arrays.

Implement the same vector addition operation using Mojo’s tiled approach. Each GPU thread will process an entire tile of data sequentially, demonstrating how memory locality can improve performance for certain workloads.

Key insight: Tiling trades parallel breadth for memory locality - fewer threads each doing more work with better cache utilization.

Key concepts

In this puzzle, you’ll master:

  • Tile-based memory organization for cache optimization
  • Sequential SIMD processing within tiles
  • Memory locality principles and cache-friendly access patterns
  • Thread-to-tile mapping vs thread-to-element mapping
  • Performance trade-offs between parallelism and memory efficiency

The same mathematical operation as elementwise: \[\Large \text{out}[i] = a[i] + b[i]\]

But with a completely different execution strategy optimized for memory hierarchy.

Configuration

  • Vector size: SIZE = 1024
  • Tile size: TILE_SIZE = 32
  • Data type: DType.float32
  • SIMD width: GPU-dependent (for operations within tiles)
  • Layout: Layout.row_major(SIZE) (1D row-major)

Code to complete

alias TILE_SIZE = 32


fn tiled_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn process_tiles[
        simd_width: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        print("tile_id:", tile_id)
        out_tile = out.tile[tile_size](tile_id)
        a_tile = a.tile[tile_size](tile_id)
        b_tile = b.tile[tile_size](tile_id)

        # FILL IN (6 lines at most)

    num_tiles = (size + tile_size - 1) // tile_size
    elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)


View full file: problems/p20/p20.mojo

Tips

1. Understanding tile organization

The tiled approach divides your data into fixed-size chunks:

num_tiles = (size + tile_size - 1) // tile_size  # Ceiling division

For a 1024-element vector with TILE_SIZE=32: 1024 ÷ 32 = 32 tiles exactly.

2. Tile extraction pattern

Check out the LayoutTensor .tile documentation.

tile_id = indices[0]  # Each thread gets one tile to process
out_tile = out.tile[tile_size](tile_id)
a_tile = a.tile[tile_size](tile_id)
b_tile = b.tile[tile_size](tile_id)

The tile[size](id) method creates a view of size consecutive elements starting at id × size.

3. Sequential processing within tiles

Unlike elementwise, you process the tile sequentially:

@parameter
for i in range(tile_size):
    # Process element i within the current tile

This @parameter loop unrolls at compile-time for optimal performance.

4. SIMD operations within tile elements

a_vec = a_tile.load[simd_width](i, 0)  # Load from position i in tile
b_vec = b_tile.load[simd_width](i, 0)  # Load from position i in tile
result = a_vec + b_vec                 # SIMD addition (GPU-dependent width)
out_tile.store[simd_width](i, 0, result)  # Store to position i in tile

5. Thread configuration difference

elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)

Note the 1 instead of SIMD_WIDTH - each thread processes one entire tile sequentially.

6. Memory access pattern insight

Each thread accesses a contiguous block of memory (the tile), then moves to the next tile. This creates excellent spatial locality within each thread’s execution.

7. Key debugging insight

With tiling, you’ll see fewer thread launches but each does more work:

  • Elementwise: ~256 threads (for SIMD_WIDTH=4), each processing 4 elements
  • Tiled: ~32 threads, each processing 32 elements sequentially

Running the code

To test your solution, run the following command in your terminal:

uv run poe p20 --tiled
pixi run p20 --tiled

Your output will look like this when not yet solved:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0
tile_id: 1
tile_id: 2
tile_id: 3
...
tile_id: 29
tile_id: 30
tile_id: 31
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

Solution

alias TILE_SIZE = 32


fn tiled_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn process_tiles[
        simd_width: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]

        out_tile = out.tile[tile_size](tile_id)
        a_tile = a.tile[tile_size](tile_id)
        b_tile = b.tile[tile_size](tile_id)

        @parameter
        for i in range(tile_size):
            a_vec = a_tile.load[simd_width](i, 0)
            b_vec = b_tile.load[simd_width](i, 0)
            ret = a_vec + b_vec
            out_tile.store[simd_width](i, 0, ret)

    num_tiles = (size + tile_size - 1) // tile_size
    elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)


The tiled processing pattern demonstrates advanced memory optimization techniques for GPU programming:

1. Tiling philosophy and memory hierarchy

Tiling represents a fundamental shift in how we think about parallel processing:

Elementwise approach:

  • Wide parallelism: Many threads, each doing minimal work
  • Global memory pressure: Threads scattered across entire array
  • Cache misses: Poor spatial locality across thread boundaries

Tiled approach:

  • Deep parallelism: Fewer threads, each doing substantial work
  • Localized memory access: Each thread works on contiguous data
  • Cache optimization: Excellent spatial and temporal locality

2. Tile organization and indexing

tile_id = indices[0]
out_tile = out.tile[tile_size](tile_id)
a_tile = a.tile[tile_size](tile_id)
b_tile = b.tile[tile_size](tile_id)

Tile mapping visualization (TILE_SIZE=32):

Original array: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ..., 1023]

Tile 0 (thread 0): [0, 1, 2, ..., 31]      ← Elements 0-31
Tile 1 (thread 1): [32, 33, 34, ..., 63]   ← Elements 32-63
Tile 2 (thread 2): [64, 65, 66, ..., 95]   ← Elements 64-95
...
Tile 31 (thread 31): [992, 993, ..., 1023] ← Elements 992-1023

Key insights:

  • Each tile[size](id) creates a view into the original tensor
  • Views are zero-copy - no data movement, just pointer arithmetic
  • Tile boundaries are always aligned to tile_size boundaries

3. Sequential processing deep dive

@parameter
for i in range(tile_size):
    a_vec = a_tile.load[simd_width](i, 0)
    b_vec = b_tile.load[simd_width](i, 0)
    ret = a_vec + b_vec
    out_tile.store[simd_width](i, 0, ret)

Why sequential processing?

  • Cache optimization: Consecutive memory accesses maximize cache hit rates
  • Compiler optimization: @parameter loops unroll completely at compile-time
  • Memory bandwidth: Sequential access aligns with memory controller design
  • Reduced coordination: No need to synchronize between SIMD groups

Execution pattern within one tile (TILE_SIZE=32, SIMD_WIDTH=4):

Thread processes tile sequentially:
Step 0: Process elements [0:4] with SIMD
Step 1: Process elements [4:8] with SIMD
Step 2: Process elements [8:12] with SIMD
...
Step 7: Process elements [28:32] with SIMD
Total: 8 SIMD operations per thread (32 ÷ 4 = 8)

4. Memory access pattern analysis

Cache behavior comparison:

Elementwise pattern:

Thread 0: accesses global positions [0, 4, 8, 12, ...]    ← Stride = SIMD_WIDTH
Thread 1: accesses global positions [4, 8, 12, 16, ...]   ← Stride = SIMD_WIDTH
...
Result: Memory accesses spread across entire array

Tiled pattern:

Thread 0: accesses positions [0:32] sequentially         ← Contiguous 32-element block
Thread 1: accesses positions [32:64] sequentially       ← Next contiguous 32-element block
...
Result: Perfect spatial locality within each thread

Cache efficiency implications:

  • L1 cache: Small tiles often fit better in L1 cache, reducing cache misses
  • Memory bandwidth: Sequential access maximizes effective bandwidth
  • TLB efficiency: Fewer translation lookbook buffer misses
  • Prefetching: Hardware prefetchers work optimally with sequential patterns

5. Thread configuration strategy

elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)

Why 1 instead of SIMD_WIDTH?

  • Thread count: Launch exactly num_tiles threads, not num_tiles × SIMD_WIDTH
  • Work distribution: Each thread handles one complete tile
  • Load balancing: More work per thread, fewer threads total
  • Memory locality: Each thread’s work is spatially localized

Performance trade-offs:

  • Fewer logical threads: May not fully utilize all GPU cores at low occupancy
  • More work per thread: Better cache utilization and reduced coordination overhead
  • Sequential access: Optimal memory bandwidth utilization within each thread
  • Reduced overhead: Less thread launch and coordination overhead

Important note: “Fewer threads” refers to the logical programming model. The GPU scheduler can still achieve high hardware utilization by running multiple warps and efficiently switching between them during memory stalls.

6. Performance characteristics

When tiling helps:

  • Memory-bound operations: When memory bandwidth is the bottleneck
  • Cache-sensitive workloads: Operations that benefit from data reuse
  • Complex operations: When compute per element is higher
  • Limited parallelism: When you have fewer threads than GPU cores

When tiling hurts:

  • Highly parallel workloads: When you need maximum thread utilization
  • Simple operations: When memory access dominates over computation
  • Irregular access patterns: When tiling doesn’t improve locality

For our simple addition example (TILE_SIZE=32):

  • Thread count: 32 threads instead of 256 (8× fewer)
  • Work per thread: 32 elements instead of 4 (8× more)
  • Memory pattern: Sequential vs strided access
  • Cache utilization: Much better spatial locality

7. Advanced tiling considerations

Tile size selection:

  • Too small: Poor cache utilization, more overhead
  • Too large: May not fit in cache, reduced parallelism
  • Sweet spot: Usually 16-64 elements for L1 cache optimization
  • Our choice: 32 elements balances cache usage with parallelism

Hardware considerations:

  • Cache size: Tiles should fit in L1 cache when possible
  • Memory bandwidth: Consider memory controller width
  • Core count: Ensure enough tiles to utilize all cores
  • SIMD width: Tile size should be multiple of SIMD width

Comparison summary:

Elementwise: High parallelism, scattered memory access
Tiled:       Moderate parallelism, localized memory access

The choice between elementwise and tiled patterns depends on your specific workload characteristics, data access patterns, and target hardware capabilities.

Next steps

Now that you understand both elementwise and tiled patterns:

💡 Key takeaway: Tiling demonstrates how memory access patterns often matter more than raw computational throughput. The best GPU code balances parallelism with memory hierarchy optimization.

🔧 Vectorization - Fine-Grained SIMD Control

Overview

This puzzle explores advanced vectorization techniques using manual vectorization and vectorize that give you precise control over SIMD operations within GPU kernels. You’ll implement two different approaches to vectorized computation:

  1. Manual vectorization: Direct SIMD control with explicit index calculations
  2. Mojo’s vectorize function: High-level vectorization with automatic bounds checking

Both approaches build on tiling concepts but with different trade-offs between control, safety, and performance optimization.

Key insight: Different vectorization strategies suit different performance requirements and complexity levels.

Key concepts

In this puzzle, you’ll master:

  • Manual SIMD operations with explicit index management
  • Mojo’s vectorize function for safe, automatic vectorization
  • Chunk-based memory organization for optimal SIMD alignment
  • Bounds checking strategies for edge cases
  • Performance trade-offs between manual control and safety

The same mathematical operation as before: \[\Large \text{out}[i] = a[i] + b[i]\]

But with sophisticated vectorization strategies for maximum performance.

Configuration

  • Vector size: SIZE = 1024
  • Tile size: TILE_SIZE = 32
  • Data type: DType.float32
  • SIMD width: GPU-dependent
  • Layout: Layout.row_major(SIZE) (1D row-major)

1. Manual vectorization approach

Code to complete

fn manual_vectorized_tiled_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size groups of simd_width elements
    alias chunk_size = tile_size * simd_width

    @parameter
    @always_inline
    fn process_manual_vectorized_tiles[
        num_threads_per_tile: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        print("tile_id:", tile_id)
        out_tile = out.tile[chunk_size](tile_id)
        a_tile = a.tile[chunk_size](tile_id)
        b_tile = b.tile[chunk_size](tile_id)

        # FILL IN (7 lines at most)

    # Number of tiles needed: each tile processes chunk_size elements
    num_tiles = (size + chunk_size - 1) // chunk_size
    elementwise[
        process_manual_vectorized_tiles, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


View full file: problems/p20/p20.mojo

Tips

1. Understanding chunk organization

alias chunk_size = tile_size * simd_width  # 32 * 4 = 128 elements per chunk

Each tile now contains multiple SIMD groups, not just sequential elements.

2. Global index calculation

global_start = tile_id * chunk_size + i * simd_width

This calculates the exact global position for each SIMD vector within the chunk.

3. Direct tensor access

a_vec = a.load[simd_width](global_start, 0)  # Load from global tensor
out.store[simd_width](global_start, 0, ret)  # Store to global tensor

Note: Access the original tensors, not the tile views.

4. Key characteristics

  • More control, more complexity, global tensor access
  • Perfect SIMD alignment with hardware
  • Manual bounds checking required

Running manual vectorization

uv run poe p20 --manual-vectorized
pixi run p20 --manual-vectorized

Your output will look like this when not yet solved:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0
tile_id: 1
tile_id: 2
tile_id: 3
tile_id: 4
tile_id: 5
tile_id: 6
tile_id: 7
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

Manual vectorization solution

fn manual_vectorized_tiled_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size groups of simd_width elements
    alias chunk_size = tile_size * simd_width

    @parameter
    @always_inline
    fn process_manual_vectorized_tiles[
        num_threads_per_tile: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]

        out_tile = out.tile[chunk_size](tile_id)
        a_tile = a.tile[chunk_size](tile_id)
        b_tile = b.tile[chunk_size](tile_id)

        @parameter
        for i in range(tile_size):
            global_start = tile_id * chunk_size + i * simd_width

            a_vec = a.load[simd_width](global_start, 0)
            b_vec = b.load[simd_width](global_start, 0)
            ret = a_vec + b_vec
            # print("tile:", tile_id, "simd_group:", i, "global_start:", global_start, "a_vec:", a_vec, "b_vec:", b_vec, "result:", ret)

            out.store[simd_width](global_start, 0, ret)

    # Number of tiles needed: each tile processes chunk_size elements
    num_tiles = (size + chunk_size - 1) // chunk_size
    elementwise[
        process_manual_vectorized_tiles, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


Manual vectorization deep dive

Manual vectorization gives you direct control over SIMD operations with explicit index calculations:

  • Chunk-based organization: chunk_size = tile_size * simd_width
  • Global indexing: Direct calculation of memory positions
  • Manual bounds management: You handle edge cases explicitly

Architecture and memory layout:

alias chunk_size = tile_size * simd_width  # 32 * 4 = 128

Chunk organization visualization (TILE_SIZE=32, SIMD_WIDTH=4):

Original array: [0, 1, 2, 3, ..., 1023]

Chunk 0 (thread 0): [0:128]    ← 128 elements = 32 SIMD groups of 4
Chunk 1 (thread 1): [128:256]  ← Next 128 elements
Chunk 2 (thread 2): [256:384]  ← Next 128 elements
...
Chunk 7 (thread 7): [896:1024] ← Final 128 elements

Processing within one chunk:

@parameter
for i in range(tile_size):  # i = 0, 1, 2, ..., 31
    global_start = tile_id * chunk_size + i * simd_width
    # For tile_id=0: global_start = 0, 4, 8, 12, ..., 124
    # For tile_id=1: global_start = 128, 132, 136, 140, ..., 252

Performance characteristics:

  • Thread count: 8 threads (1024 ÷ 128 = 8)
  • Work per thread: 128 elements (32 SIMD operations of 4 elements each)
  • Memory pattern: Large chunks with perfect SIMD alignment
  • Overhead: Minimal - direct hardware mapping
  • Safety: Manual bounds checking required

Key advantages:

  • Predictable indexing: Exact control over memory access patterns
  • Optimal alignment: SIMD operations perfectly aligned to hardware
  • Maximum throughput: No overhead from safety checks
  • Hardware optimization: Direct mapping to GPU SIMD units

Key challenges:

  • Index complexity: Manual calculation of global positions
  • Bounds responsibility: Must handle edge cases explicitly
  • Debugging difficulty: More complex to verify correctness

2. Mojo vectorize approach

Code to complete

fn vectorize_within_tiles_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size elements (not SIMD groups)
    @parameter
    @always_inline
    fn process_tile_with_vectorize[
        num_threads_per_tile: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        tile_start = tile_id * tile_size
        tile_end = min(tile_start + tile_size, size)
        actual_tile_size = tile_end - tile_start
        print(
            "tile_id:",
            tile_id,
            "tile_start:",
            tile_start,
            "tile_end:",
            tile_end,
            "actual_tile_size:",
            actual_tile_size,
        )

        # FILL IN (9 lines at most)

    num_tiles = (size + tile_size - 1) // tile_size
    elementwise[
        process_tile_with_vectorize, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


View full file: problems/p20/p20.mojo

Tips

1. Tile boundary calculation

tile_start = tile_id * tile_size
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start

Handle cases where the last tile might be smaller than tile_size.

2. Vectorized function pattern

@parameter
fn vectorized_add[width: Int](i: Int):
    global_idx = tile_start + i
    if global_idx + width <= size:  # Bounds checking
        # SIMD operations here

The width parameter is automatically determined by the vectorize function.

3. Calling vectorize

vectorize[vectorized_add, simd_width](actual_tile_size)

This automatically handles the vectorization loop with the provided SIMD width.

4. Key characteristics

  • Automatic remainder handling, built-in safety, tile-based access
  • Takes explicit SIMD width parameter
  • Built-in bounds checking and automatic remainder element processing

Running Mojo vectorize

uv run poe p20 --vectorized
pixi run p20 --vectorized

Your output will look like this when not yet solved:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0 tile_start: 0 tile_end: 32 actual_tile_size: 32
tile_id: 1 tile_start: 32 tile_end: 64 actual_tile_size: 32
tile_id: 2 tile_start: 64 tile_end: 96 actual_tile_size: 32
tile_id: 3 tile_start: 96 tile_end: 128 actual_tile_size: 32
...
tile_id: 29 tile_start: 928 tile_end: 960 actual_tile_size: 32
tile_id: 30 tile_start: 960 tile_end: 992 actual_tile_size: 32
tile_id: 31 tile_start: 992 tile_end: 1024 actual_tile_size: 32
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

Mojo vectorize solution

fn vectorize_within_tiles_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size elements (not SIMD groups)
    @parameter
    @always_inline
    fn process_tile_with_vectorize[
        num_threads_per_tile: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        tile_start = tile_id * tile_size
        tile_end = min(tile_start + tile_size, size)
        actual_tile_size = tile_end - tile_start

        @parameter
        fn vectorized_add[width: Int](i: Int):
            global_idx = tile_start + i
            if global_idx + width <= size:
                a_vec = a.load[width](global_idx, 0)
                b_vec = b.load[width](global_idx, 0)
                result = a_vec + b_vec
                out.store[width](global_idx, 0, result)

        # Use vectorize within each tile
        vectorize[vectorized_add, simd_width](actual_tile_size)

    num_tiles = (size + tile_size - 1) // tile_size
    elementwise[
        process_tile_with_vectorize, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


Mojo vectorize deep dive

Mojo’s vectorize function provides automatic vectorization with built-in safety:

  • Explicit SIMD width parameter: You provide the simd_width to use
  • Built-in bounds checking: Prevents buffer overruns automatically
  • Automatic remainder handling: Processes leftover elements automatically
  • Nested function pattern: Clean separation of vectorization logic

Tile-based organization:

tile_start = tile_id * tile_size    # 0, 32, 64, 96, ...
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start

Automatic vectorization mechanism:

@parameter
fn vectorized_add[width: Int](i: Int):
    global_idx = tile_start + i
    if global_idx + width <= size:
        # Automatic SIMD optimization

How vectorize works:

  • Automatic chunking: Divides actual_tile_size into chunks of your provided simd_width
  • Remainder handling: Automatically processes leftover elements with smaller widths
  • Bounds safety: Automatically prevents buffer overruns
  • Loop management: Handles the vectorization loop automatically

Execution visualization (TILE_SIZE=32, SIMD_WIDTH=4):

Tile 0 processing:
  vectorize call 0: processes elements [0:4]   with SIMD_WIDTH=4
  vectorize call 1: processes elements [4:8]   with SIMD_WIDTH=4
  ...
  vectorize call 7: processes elements [28:32] with SIMD_WIDTH=4
  Total: 8 automatic SIMD operations

Performance characteristics:

  • Thread count: 32 threads (1024 ÷ 32 = 32)
  • Work per thread: 32 elements (automatic SIMD chunking)
  • Memory pattern: Smaller tiles with automatic vectorization
  • Overhead: Slight - automatic optimization and bounds checking
  • Safety: Built-in bounds checking and edge case handling

Performance comparison and best practices

When to use each approach

Choose manual vectorization when:

  • Maximum performance is critical
  • You have predictable, aligned data patterns
  • Expert-level control over memory access is needed
  • You can guarantee bounds safety manually
  • Hardware-specific optimization is required

Choose Mojo vectorize when:

  • Development speed and safety are priorities
  • Working with irregular or dynamic data sizes
  • You want automatic remainder handling instead of manual edge case management
  • Bounds checking complexity would be error-prone
  • You prefer cleaner vectorization patterns over manual loop management

Advanced optimization insights

Memory bandwidth utilization:

Manual:    8 threads × 32 SIMD ops = 256 total SIMD operations
Vectorize: 32 threads × 8 SIMD ops = 256 total SIMD operations

Both achieve similar total throughput but with different parallelism strategies.

Cache behavior:

  • Manual: Large chunks may exceed L1 cache, but perfect sequential access
  • Vectorize: Smaller tiles fit better in cache, with automatic remainder handling

Hardware mapping:

  • Manual: Direct control over warp utilization and SIMD unit mapping
  • Vectorize: Simplified vectorization with automatic loop and remainder management

Best practices summary

Manual vectorization best practices:

  • Always validate index calculations carefully
  • Use compile-time constants for chunk_size when possible
  • Profile memory access patterns for cache optimization
  • Consider alignment requirements for optimal SIMD performance

Mojo vectorize best practices:

  • Choose appropriate SIMD width for your data and hardware
  • Focus on algorithm clarity over micro-optimizations
  • Use nested parameter functions for clean vectorization logic
  • Trust automatic bounds checking and remainder handling for edge cases

Both approaches represent valid strategies in the GPU performance optimization toolkit, with manual vectorization offering maximum control and Mojo’s vectorize providing safety and automatic remainder handling.

Next steps

Now that you understand all three fundamental patterns:

💡 Key takeaway: Different vectorization strategies suit different performance requirements. Manual vectorization gives maximum control, while Mojo’s vectorize function provides safety and automatic remainder handling. Choose based on your specific performance needs and development constraints.

🧠 GPU Threading vs SIMD - Understanding the Execution Hierarchy

Overview

After exploring elementwise, tiled, and vectorization patterns, you’ve seen different ways to organize GPU computation. This section clarifies the fundamental relationship between GPU threads and SIMD operations - two distinct but complementary levels of parallelism that work together for optimal performance.

Key insight: GPU threads provide the parallelism structure, while SIMD operations provide the vectorization within each thread.

Core concepts

GPU threading hierarchy

GPU execution follows a well-defined hierarchy that abstracts hardware complexity:

GPU Device
├── Grid (your entire problem)
│   ├── Block 1 (group of threads, shared memory)
│   │   ├── Warp 1 (32 threads, lockstep execution)
│   │   │   ├── Thread 1 → SIMD operations
│   │   │   ├── Thread 2 → SIMD operations
│   │   │   └── ... (32 threads total)
│   │   └── Warp 2 (32 threads)
│   └── Block 2 (independent group)

💡 Note: While this Part focuses on functional patterns, warp-level programming and advanced GPU memory management will be covered in detail in Part VI.

What Mojo abstracts for you:

  • Grid/Block configuration: Automatically calculated based on problem size
  • Warp management: Hardware handles 32-thread groups transparently
  • Thread scheduling: GPU scheduler manages execution automatically
  • Memory hierarchy: Optimal access patterns built into functional operations

SIMD within GPU threads

Each GPU thread can process multiple data elements simultaneously using SIMD (Single Instruction, Multiple Data) operations:

// Within one GPU thread:
a_simd = a.load[simd_width](idx, 0)    // Load 4 floats simultaneously
b_simd = b.load[simd_width](idx, 0)    // Load 4 floats simultaneously
result = a_simd + b_simd               // Add 4 pairs simultaneously
out.store[simd_width](idx, 0, result)  // Store 4 results simultaneously

Pattern comparison and thread-to-work mapping

Critical insight: All patterns perform the same total work - 256 SIMD operations for 1024 elements with SIMD_WIDTH=4. The difference is in how this work is distributed across GPU threads.

Thread organization comparison (SIZE=1024, SIMD_WIDTH=4)

PatternThreadsSIMD ops/threadMemory patternTrade-off
Elementwise2561Distributed accessMax parallelism, poor locality
Tiled328Small blocksBalanced parallelism + locality
Manual vectorized832Large chunksHigh bandwidth, fewer threads
Mojo vectorize328Smart blocksAutomatic optimization

Detailed execution patterns

Elementwise pattern:

Thread 0: [0,1,2,3] → Thread 1: [4,5,6,7] → ... → Thread 255: [1020,1021,1022,1023]
256 threads × 1 SIMD op = 256 total SIMD operations

Tiled pattern:

Thread 0: [0:32] (8 SIMD) → Thread 1: [32:64] (8 SIMD) → ... → Thread 31: [992:1024] (8 SIMD)
32 threads × 8 SIMD ops = 256 total SIMD operations

Manual vectorized pattern:

Thread 0: [0:128] (32 SIMD) → Thread 1: [128:256] (32 SIMD) → ... → Thread 7: [896:1024] (32 SIMD)
8 threads × 32 SIMD ops = 256 total SIMD operations

Mojo vectorize pattern:

Thread 0: [0:32] auto-vectorized → Thread 1: [32:64] auto-vectorized → ... → Thread 31: [992:1024] auto-vectorized
32 threads × 8 SIMD ops = 256 total SIMD operations

Performance characteristics and trade-offs

Core trade-offs summary

AspectHigh thread count (Elementwise)Moderate threads (Tiled/Vectorize)Low threads (Manual)
ParallelismMaximum latency hidingBalanced approachMinimal parallelism
Cache localityPoor between threadsGood within tilesExcellent sequential
Memory bandwidthGood coalescingGood + cache reuseMaximum theoretical
ComplexitySimplestModerateMost complex

When to choose each pattern

Use elementwise when:

  • Simple operations with minimal arithmetic per element
  • Maximum parallelism needed for latency hiding
  • Scalability across different problem sizes is important

Use tiled/vectorize when:

  • Cache-sensitive operations that benefit from data reuse
  • Balanced performance and maintainability desired
  • Automatic optimization (vectorize) is preferred

Use manual vectorization when:

  • Expert-level control over memory patterns is needed
  • Maximum memory bandwidth utilization is critical
  • Development complexity is acceptable

Hardware considerations

Modern GPU architectures include several levels that Mojo abstracts:

Hardware reality:

  • Warps: 32 threads execute in lockstep
  • Streaming Multiprocessors (SMs): Multiple warps execute concurrently
  • SIMD units: Vector processing units within each SM
  • Memory hierarchy: L1/L2 caches, shared memory, global memory

Mojo’s abstraction benefits:

  • Automatically handles warp alignment and scheduling
  • Optimizes memory access patterns transparently
  • Manages resource allocation across SMs
  • Provides portable performance across GPU vendors

Performance mental model

Think of GPU programming as managing two complementary types of parallelism:

Thread-level parallelism:

  • Provides the parallel structure (how many execution units)
  • Enables latency hiding through concurrent execution
  • Managed by GPU scheduler automatically

SIMD-level parallelism:

  • Provides vectorization within each thread
  • Maximizes arithmetic throughput per thread
  • Utilizes vector processing units efficiently

Optimal performance formula:

Performance = (Sufficient threads for latency hiding) ×
              (Efficient SIMD utilization) ×
              (Optimal memory access patterns)

Scaling considerations

Problem sizeOptimal patternReasoning
Small (< 1K)Tiled/VectorizeLower launch overhead
Medium (1K-1M)Any patternSimilar performance
Large (> 1M)Usually ElementwiseParallelism dominates

The optimal choice depends on your specific hardware, workload complexity, and development constraints.

Next steps

With a solid understanding of GPU threading vs SIMD concepts:

💡 Key takeaway: GPU threads and SIMD operations work together as complementary levels of parallelism. Understanding their relationship allows you to choose the right pattern for your specific performance requirements and constraints.

📊 Benchmarking - Performance Analysis and Optimization

Overview

After mastering elementwise, tiled, manual vectorization, and Mojo vectorize patterns, it’s time to measure their actual performance. This guide explains how to use the built-in benchmarking system in p20.mojo to scientifically compare these approaches and understand their performance characteristics.

Key insight: Theoretical analysis is valuable, but empirical benchmarking reveals the true performance story on your specific hardware.

Running benchmarks

To execute the comprehensive benchmark suite:

uv run poe p20 --benchmark
pixi run p20 --benchmark

Your output will show performance measurements for each pattern:

SIZE: 1024
simd_width: 4
Running P20 GPU Benchmarks...
SIMD width: 4
--------------------------------------------------------------------------------
Testing SIZE=16, TILE=4
Running elementwise_16_4
Running tiled_16_4
Running manual_vectorized_16_4
Running vectorized_16_4
--------------------------------------------------------------------------------
Testing SIZE=128, TILE=16
Running elementwise_128_16
Running tiled_128_16
Running manual_vectorized_128_16
Testing SIZE=128, TILE=16, Vectorize within tiles
Running vectorized_128_16
--------------------------------------------------------------------------------
Testing SIZE=1048576 (1M), TILE=1024
Running elementwise_1M_1024
Running tiled_1M_1024
Running manual_vectorized_1M_1024
Running vectorized_1M_1024
----------------------------------------------------------
| name                      | met (ms)           | iters |
----------------------------------------------------------
| elementwise_16_4          | 4.59953155         | 100   |
| tiled_16_4                | 3.16459014         | 100   |
| manual_vectorized_16_4    | 4.60563415         | 100   |
| vectorized_16_4           | 3.15671539         | 100   |
| elementwise_128_16        | 3.1611135375       | 80    |
| tiled_128_16              | 3.1669656300000004 | 100   |
| manual_vectorized_128_16  | 3.1609855625       | 80    |
| vectorized_128_16         | 3.16142578         | 100   |
| elementwise_1M_1024       | 11.338706742857143 | 70    |
| tiled_1M_1024             | 12.044989871428571 | 70    |
| manual_vectorized_1M_1024 | 15.749412314285713 | 70    |
| vectorized_1M_1024        | 13.377229          | 100   |
----------------------------------------------------------

Benchmarks completed!

Benchmark configuration

The benchmarking system uses Mojo’s built-in benchmark module:

from benchmark import Bench, BenchConfig, Bencher, BenchId, keep
bench_config = BenchConfig(max_iters=10, min_warmuptime_secs=0.2)
  • max_iters=10: Up to 10 iterations for statistical reliability
  • min_warmuptime_secs=0.2: GPU warmup before measurement
  • Check out the benchmark documentation

Benchmarking implementation essentials

Core workflow pattern

Each benchmark follows a streamlined pattern:

@parameter
fn benchmark_pattern_parameterized[test_size: Int, tile_size: Int](mut b: Bencher) raises:
    @parameter
    fn pattern_workflow(ctx: DeviceContext) raises:
        # Setup: Create buffers and initialize data
        # Compute: Execute the algorithm being measured
        # Prevent optimization: keep(out.unsafe_ptr())
        # Synchronize: ctx.synchronize()

    bench_ctx = DeviceContext()
    b.iter_custom[pattern_workflow](bench_ctx)

Key phases:

  1. Setup: Buffer allocation and data initialization
  2. Computation: The actual algorithm being benchmarked
  3. Prevent optimization: Critical for accurate measurement
  4. Synchronization: Ensure GPU work completes

Critical: The keep() function keep(out.unsafe_ptr()) prevents the compiler from optimizing away your computation as “unused code.” Without this, you might measure nothing instead of your algorithm! This is essential for accurate GPU benchmarking because kernels are launched asynchronously.

Why custom iteration works for GPU

Standard benchmarking assumes CPU-style synchronous execution. GPU kernels launch asynchronously, so we need:

  • GPU context management: Proper DeviceContext lifecycle
  • Memory management: Buffer cleanup between iterations
  • Synchronization handling: Accurate timing of async operations
  • Overhead isolation: Separate setup cost from computation cost

Test scenarios and thread analysis

The benchmark suite tests three scenarios to reveal performance characteristics:

Thread utilization summary

Problem SizePatternThreadsSIMD ops/threadTotal SIMD ops
SIZE=16Elementwise414
Tiled414
Manual144
Vectorize414
SIZE=128Elementwise32132
Tiled8432
Manual21632
Vectorize8432
SIZE=1MElementwise262,1441262,144
Tiled1,024256262,144
Manual2561,024262,144
Vectorize1,024256262,144

Performance characteristics by problem size

Small problems (SIZE=16):

  • Launch overhead dominates (~3-4ms baseline)
  • Thread count differences don’t matter
  • Tiled/vectorize show lower overhead

Medium problems (SIZE=128):

  • Still overhead-dominated (~3.16ms for all)
  • Performance differences nearly disappear
  • Transitional behavior between overhead and computation

Large problems (SIZE=1M):

  • Real algorithmic differences emerge
  • Memory bandwidth becomes primary factor
  • Clear performance ranking appears

What the data shows

Based on empirical benchmark results across different hardware:

Performance rankings (large problems)

RankPatternTypical timeKey insight
🥇Elementwise~11.3msMax parallelism wins for memory-bound ops
🥈Tiled~12.0msGood balance of parallelism + locality
🥉Mojo vectorize~13.4msAutomatic optimization has overhead
4thManual vectorized~15.7msComplex indexing hurts simple operations

Key performance insights

For simple memory-bound operations: Maximum parallelism (elementwise) outperforms complex memory optimizations at scale.

Why elementwise wins:

  • 262,144 threads provide excellent latency hiding
  • Simple memory patterns achieve good coalescing
  • Minimal overhead per thread
  • Scales naturally with GPU core count

Why manual vectorization struggles:

  • Only 256 threads limit parallelism
  • Complex indexing adds computational overhead
  • Cache pressure from large chunks per thread
  • Diminishing returns for simple arithmetic

Framework intelligence:

  • Automatic iteration count adjustment (70-100 iterations)
  • Statistical reliability across different execution times
  • Handles thermal throttling and system variation

Interpreting your results

Reading the output table

| name                     | met (ms)           | iters |
| elementwise_1M_1024      | 11.338706742857143 | 70    |
  • met (ms): Total execution time for all iterations
  • iters: Number of iterations performed
  • Compare within problem size: Same-size comparisons are most meaningful

Making optimization decisions

Choose patterns based on empirical evidence:

For production workloads:

  • Large datasets (>100K elements): Elementwise typically optimal
  • Small/startup datasets (<1K elements): Tiled or vectorize for lower overhead
  • Development speed priority: Mojo vectorize for automatic optimization
  • Avoid manual vectorization: Complexity rarely pays off for simple operations

Performance optimization workflow:

  1. Profile first: Measure before optimizing
  2. Test at scale: Small problems mislead about real performance
  3. Consider total cost: Include development and maintenance effort
  4. Validate improvements: Confirm with benchmarks on target hardware

Advanced benchmarking techniques

Custom test scenarios

Modify parameters to test different conditions:

# Different problem sizes
benchmark_elementwise_parameterized[1024, 32]  # Large problem
benchmark_elementwise_parameterized[64, 8]     # Small problem

# Different tile sizes
benchmark_tiled_parameterized[256, 8]   # Small tiles
benchmark_tiled_parameterized[256, 64]  # Large tiles

Hardware considerations

Your results will vary based on:

  • GPU architecture: SIMD width, core count, memory bandwidth
  • System configuration: PCIe bandwidth, CPU performance
  • Thermal state: GPU boost clocks vs sustained performance
  • Concurrent workloads: Other processes affecting GPU utilization

Best practices summary

Benchmarking workflow:

  1. Warm up GPU before critical measurements
  2. Run multiple iterations for statistical significance
  3. Test multiple problem sizes to understand scaling
  4. Use keep() consistently to prevent optimization artifacts
  5. Compare like with like (same problem size, same hardware)

Performance decision framework:

  • Start simple: Begin with elementwise for memory-bound operations
  • Measure don’t guess: Theoretical analysis guides, empirical data decides
  • Scale matters: Small problem performance doesn’t predict large problem behavior
  • Total cost optimization: Balance development time vs runtime performance

Next steps

With benchmarking mastery:

  • Profile real applications: Apply these patterns to actual workloads
  • Advanced GPU patterns: Explore reductions, convolutions, and matrix operations
  • Multi-GPU scaling: Understand distributed GPU computing patterns
  • Memory optimization: Dive deeper into shared memory and advanced caching

💡 Key takeaway: Benchmarking transforms theoretical understanding into practical performance optimization. Use empirical data to make informed decisions about which patterns work best for your specific hardware and workload characteristics.

Looking Ahead: When you need more control

The functional patterns in Part V provide excellent performance for most workloads, but some algorithms require direct thread communication:

Algorithms that benefit from warp programming:

  • Reductions: Sum, max, min operations across thread groups
  • Prefix operations: Cumulative sums, running maximums
  • Data shuffling: Reorganizing data between threads
  • Cooperative algorithms: Where threads must coordinate closely

Performance preview:

In Part VI, we’ll revisit several algorithms from Part II and show how warp operations can:

  • Simplify code: Replace complex shared memory patterns with single function calls
  • Improve performance: Eliminate barriers and reduce memory traffic
  • Enable new algorithms: Unlock patterns impossible with pure functional approaches

Coming up next: Part VI: Warp-Level Programming - starting with a dramatic reimplementation of Puzzle 12’s prefix sum.

⚡ Part VI: GPU Warp Programming - Synchronized Execution Primitives

Overview

Welcome to Part VI: GPU Warp Programming! This section introduces you to GPU warp-level primitives - hardware-accelerated operations that leverage synchronized thread execution within warps. You’ll master the art of using built-in warp operations to replace complex shared memory patterns with simple, efficient function calls.

What you’ll achieve: Transform from complex shared memory + barrier + tree reduction patterns to elegant warp primitive calls that leverage hardware synchronization.

Key insight: GPU warps execute in lockstep - Mojo’s warp operations harness this synchronization to provide powerful parallel primitives with zero explicit synchronization.

What you’ll learn

🧠 GPU warp execution model

Understand the fundamental hardware unit of GPU parallelism:

GPU Block (e.g., 256 threads)
├── Warp 0 (32 threads, SIMT lockstep execution)
│   ├── Lane 0  ─┐
│   ├── Lane 1   │ All execute same instruction
│   ├── Lane 2   │ at same time (SIMT)
│   │   ...      │
│   └── Lane 31 ─┘
├── Warp 1 (32 threads, independent)
├── Warp 2 (32 threads, independent)
└── ...

Hardware reality:

  • 32 threads per warp on NVIDIA GPUs (WARP_SIZE=32)
  • 32 or 64 threads per warp on AMD GPUs (WARP_SIZE=32 or 64)
  • Lockstep execution: All threads in a warp execute the same instruction simultaneously
  • Zero synchronization cost: Warp operations happen instantly within each warp

Warp operations available in Mojo

Master the core warp primitives from gpu.warp:

  1. sum(value): Sum all values across warp lanes
  2. shuffle_idx(value, lane): Get value from specific lane
  3. shuffle_down(value, delta): Get value from lane+delta
  4. prefix_sum(value): Compute prefix sum across lanes
  5. lane_id(): Get current thread’s lane number (0-31 or 0-63)

🎯 Performance transformation example

# Complex pattern we have seen earlier (from p10.mojo):
shared = tb[dtype]().row_major[WARP_SIZE]().shared().alloc()
shared[local_i] = partial_product
barrier()
# Tree reduction with barriers...
stride = SIZE // 2
while stride > 0:
    if local_i < stride:
        shared[local_i] += shared[local_i + stride]
    barrier()
    stride //= 2

# Can be replaced with the simple warp approach:
total = sum(partial_product)

📊 When warp operations excel

Learn the performance characteristics:

Problem Scale         Traditional    Warp Operations
Single warp (32)      Fast          Fastest (no barriers)
Few warps (128)       Good          Excellent (minimal overhead)
Many warps (1024+)    Good          Outstanding (scales linearly)
Massive (16K+)        Bottlenecked  Memory-bandwidth limited

Prerequisites

Before diving into warp programming, ensure you’re comfortable with:

  • Part V functional patterns: Elementwise, tiled, and vectorized approaches
  • GPU thread hierarchy: Understanding blocks, warps, and threads
  • LayoutTensor operations: Loading, storing, and tensor manipulation
  • Shared memory concepts: Why barriers and tree reduction are complex

Learning path

🔰 1. SIMT execution model

Warp Lanes & SIMT Execution

Understand the hardware foundation that makes warp operations possible.

What you’ll master:

  • Single Instruction, Multiple Thread (SIMT) execution model
  • Warp divergence and convergence patterns
  • Lane synchronization within warps
  • Hardware vs software thread management

Key insight: Warps are the fundamental unit of GPU execution - understanding SIMT unlocks warp programming.

2. Warp sum fundamentals

warp.sum() Essentials

Master the most important warp operation through dot product implementation.

What you’ll master:

  • Replacing shared memory + barriers with sum()
  • Cross-GPU architecture compatibility (WARP_SIZE)
  • Kernel vs functional programming patterns with warps
  • Performance comparison with traditional approaches

Key pattern:

partial_result = compute_per_lane_value()
total = sum(partial_result)  # Magic happens here!
if lane_id() == 0:
    output[0] = total

📊 3. When to use warp programming

When to Use Warp Programming

Learn the decision framework for choosing warp operations over alternatives.

What you’ll master:

  • Problem characteristics that favor warp operations
  • Performance scaling patterns with warp count
  • Memory bandwidth vs computation trade-offs
  • Warp operation selection guidelines

Decision framework: When reduction operations become the bottleneck, warp primitives often provide the breakthrough.

Key concepts to master

🎯 Hardware-software alignment

Understanding how Mojo’s warp operations map to GPU hardware:

  • SIMT execution: All lanes execute same instruction simultaneously
  • Built-in synchronization: No explicit barriers needed within warps
  • Cross-architecture support: WARP_SIZE handles NVIDIA vs AMD differences

Pattern transformation

Converting complex parallel patterns to warp primitives:

  • Tree reductionsum()
  • Prefix computationprefix_sum()
  • Data shufflingshuffle_idx(), shuffle_down()

📈 Performance characteristics

Recognizing when warp operations provide advantages:

  • Small to medium problems: Eliminates barrier overhead
  • Large problems: Reduces memory traffic and improves cache utilization
  • Regular patterns: Warp operations excel with predictable access patterns

Getting started

Ready to harness GPU warp-level parallelism? Start with understanding the SIMT execution model, then dive into practical warp sum implementation, and finish with the strategic decision framework.

💡 Success tip: Think of warps as synchronized vector units rather than independent threads. This mental model will guide you toward effective warp programming patterns.

🎯 Learning objective: By the end of Part VI, you’ll recognize when warp operations can replace complex synchronization patterns, enabling you to write simpler, faster GPU code.

Ready to begin? Start with SIMT Execution Model and discover the power of warp-level programming!

🧠 Warp Lanes & SIMT Execution

Mental model for warp programming vs SIMD

What is a warp?

A warp is a group of 32 (or 64) GPU threads that execute the same instruction at the same time on different data. Think of it as a synchronized vector unit where each thread acts like a “lane” in a vector processor.

Simple example:

from gpu.warp import sum
# All 32 threads in the warp execute this simultaneously:
var my_value = input[my_thread_id]     # Each gets different data
var warp_total = sum(my_value)    # All contribute to one sum

What just happened? Instead of 32 separate threads doing complex coordination, the warp automatically synchronized them to produce a single result. This is SIMT (Single Instruction, Multiple Thread) execution.

SIMT vs SIMD comparison

If you’re familiar with CPU vector programming (SIMD), GPU warps are similar but with key differences:

AspectCPU SIMD (e.g., AVX)GPU Warp (SIMT)
Programming modelExplicit vector operationsThread-based programming
Data widthFixed (256/512 bits)Flexible (32/64 threads)
SynchronizationImplicit within instructionImplicit within warp
CommunicationVia memory/registersVia shuffle operations
Divergence handlingNot applicableHardware masking
Examplea + bsum(thread_value)

CPU SIMD approach (C++ intrinsics):

// Explicit vector operations - say 8 floats in parallel
__m256 result = _mm256_add_ps(a, b);   // Add 8 pairs simultaneously

CPU SIMD approach (Mojo):

# SIMD in Mojo is first class citizen type so if a, b are of type SIMD then
# addition 8 floats in parallel
var result = a + b # Add 8 pairs simultaneously

GPU SIMT approach (Mojo):

# Thread-based code that becomes vector operations
from gpu.warp import sum

var my_data = input[thread_id]         # Each thread gets its element
var partial = my_data * coefficient    # All threads compute simultaneously
var total = sum(partial)               # Hardware coordinates the sum

Core concepts that make warps powerful

1. Lane identity: Each thread has a “lane ID” (0 to 31) that’s essentially free to access

var my_lane = lane_id()  # Just reading a hardware register

2. Implicit synchronization: No barriers needed within a warp

# This just works - all threads automatically synchronized
var sum = sum(my_contribution)

3. Efficient communication: Threads can share data without memory

# Get value from lane 0 to all other lanes
var broadcasted = shuffle_idx(my_value, 0)

Key insight: SIMT lets you write natural thread code that executes as efficient vector operations, combining the ease of thread programming with the performance of vector processing.

Where warps fit in GPU execution hierarchy

For complete context on how warps relate to the overall GPU execution model, see GPU Threading vs SIMD. Here’s where warps fit:

GPU Device
├── Grid (your entire problem)
│   ├── Block 1 (group of threads, shared memory)
│   │   ├── Warp 1 (32 threads, lockstep execution) ← This level
│   │   │   ├── Thread 1 → SIMD operations
│   │   │   ├── Thread 2 → SIMD operations
│   │   │   └── ... (32 threads total)
│   │   └── Warp 2 (32 threads)
│   └── Block 2 (independent group)

Warp programming operates at the “Warp level” - you work with operations that coordinate all 32 threads within a single warp, enabling powerful primitives like sum() that would otherwise require complex shared memory coordination.

This mental model helps you recognize when problems map naturally to warp operations versus requiring traditional shared memory approaches.

The hardware foundation of warp programming

Understanding Single Instruction, Multiple Thread (SIMT) execution is crucial for effective warp programming. This isn’t just a software abstraction - it’s how GPU hardware actually works at the silicon level.

What is SIMT execution?

SIMT means that within a warp, all threads execute the same instruction at the same time on different data. This is fundamentally different from CPU threads, which can execute completely different instructions independently.

CPU vs GPU Execution Models

AspectCPU (MIMD)GPU Warp (SIMT)
Instruction ModelMultiple Instructions, Multiple DataSingle Instruction, Multiple Thread
Core 1add r1, r2add r1, r2
Core 2load r3, [mem]add r1, r2 (same instruction)
Core 3branch loopadd r1, r2 (same instruction)
… Core 32different instructionadd r1, r2 (same instruction)
ExecutionIndependent, asynchronousSynchronized, lockstep
SchedulingComplex, OS-managedSimple, hardware-managed
DataIndependent data setsDifferent data, same operation

GPU Warp Execution Pattern:

  • Instruction: Same for all 32 lanes: add r1, r2
  • Lane 0: Operates on Data0Result0
  • Lane 1: Operates on Data1Result1
  • Lane 2: Operates on Data2Result2
  • … (all lanes execute simultaneously)
  • Lane 31: Operates on Data31Result31

Key insight: All lanes execute the same instruction at the same time on different data.

Why SIMT works for GPUs

GPUs are optimized for throughput, not latency. SIMT enables:

  • Hardware simplification: One instruction decoder serves 32 or 64 threads
  • Execution efficiency: No complex scheduling between warp threads
  • Memory bandwidth: Coalesced memory access patterns
  • Power efficiency: Shared control logic across lanes

Warp execution mechanics

Lane numbering and identity

Each thread within a warp has a lane ID from 0 to WARP_SIZE-1:

from gpu import lane_id
from gpu.warp import WARP_SIZE

# Within a kernel function:
my_lane = lane_id()  # Returns 0-31 (NVIDIA/RDNA) or 0-63 (CDNA)

Key insight: lane_id() is free - it’s just reading a hardware register, not computing a value.

Synchronization within warps

The most powerful aspect of SIMT: implicit synchronization.

# Traditional shared memory approach:
shared[local_i] = partial_result
barrier()  # Explicit synchronization required
var sum = shared[0] + shared[1] + ...  # Complex reduction

# Warp approach:
from gpu.warp import sum

var total = sum(partial_result)  # Implicit synchronization!

Why no barriers needed? All lanes execute each instruction at exactly the same time. When sum() starts, all lanes have already computed their partial_result.

Warp divergence and convergence

What happens with conditional code?

if lane_id() % 2 == 0:
    # Even lanes execute this path
    result = compute_even()
else:
    # Odd lanes execute this path
    result = compute_odd()
# All lanes converge here

Hardware behavior steps:

StepPhaseActive LanesWaiting LanesEfficiencyPerformance Cost
1Condition evaluationAll 32 lanesNone100%Normal speed
2Even lanes branchLanes 0,2,4…30 (16 lanes)Lanes 1,3,5…31 (16 lanes)50%2× slower
3Odd lanes branchLanes 1,3,5…31 (16 lanes)Lanes 0,2,4…30 (16 lanes)50%2× slower
4ConvergenceAll 32 lanesNone100%Normal speed resumed

Example breakdown:

  • Step 2: Only even lanes execute compute_even() while odd lanes wait
  • Step 3: Only odd lanes execute compute_odd() while even lanes wait
  • Total time: time(compute_even) + time(compute_odd) (sequential execution)
  • Without divergence: max(time(compute_even), time(compute_odd)) (parallel execution)

Performance impact:

  1. Divergence: Warp splits execution - some lanes active, others wait
  2. Serial execution: Different paths run sequentially, not in parallel
  3. Convergence: All lanes reunite and continue together
  4. Cost: Divergent warps take 2× time (or more) vs unified execution

Best practices for warp efficiency

Warp efficiency patterns

✅ EXCELLENT: Uniform execution (100% efficiency)

# All lanes do the same work - no divergence
var partial = a[global_i] * b[global_i]
var total = sum(partial)

Performance: All 32 lanes active simultaneously

⚠️ ACCEPTABLE: Predictable divergence (~95% efficiency)

# Divergence based on lane_id() - hardware optimized
if lane_id() == 0:
    output[block_idx] = sum(partial)

Performance: Brief single-lane operation, predictable pattern

🔶 CAUTION: Structured divergence (~50-75% efficiency)

# Regular patterns can be optimized by compiler
if (global_i / 4) % 2 == 0:
    result = method_a()
else:
    result = method_b()

Performance: Predictable groups, some optimization possible

❌ AVOID: Data-dependent divergence (~25-50% efficiency)

# Different lanes may take different paths based on data
if input[global_i] > threshold:  # Unpredictable branching
    result = expensive_computation()
else:
    result = simple_computation()

Performance: Random divergence kills warp efficiency

💀 TERRIBLE: Nested data-dependent divergence (~10-25% efficiency)

# Multiple levels of unpredictable branching
if input[global_i] > threshold1:
    if input[global_i] > threshold2:
        result = very_expensive()
    else:
        result = expensive()
else:
    result = simple()

Performance: Warp efficiency destroyed

Cross-architecture compatibility

NVIDIA vs AMD warp sizes

from gpu.warp import WARP_SIZE

# NVIDIA GPUs:     WARP_SIZE = 32
# AMD RDNA GPUs:   WARP_SIZE = 32 (wavefront32 mode)
# AMD CDNA GPUs:   WARP_SIZE = 64 (traditional wavefront64)

Why this matters:

  • Memory patterns: Coalesced access depends on warp size
  • Algorithm design: Reduction trees must account for warp size
  • Performance scaling: Twice as many lanes per warp on AMD

Writing portable warp code

Architecture Adaptation Strategies

✅ PORTABLE: Always use WARP_SIZE

alias THREADS_PER_BLOCK = (WARP_SIZE, 1)  # Adapts automatically
alias ELEMENTS_PER_WARP = WARP_SIZE        # Scales with hardware

Result: Code works optimally on NVIDIA/AMD (32) and AMD (64)

❌ BROKEN: Never hardcode warp size

alias THREADS_PER_BLOCK = (32, 1)  # Breaks on AMD GPUs!
alias REDUCTION_SIZE = 32           # Wrong on AMD!

Result: Suboptimal on AMD, potential correctness issues

Real Hardware Impact

GPU ArchitectureWARP_SIZEMemory per WarpReduction StepsLane Pattern
NVIDIA/AMD RDNA32128 bytes (4×32)5 steps: 32→16→8→4→2→1Lanes 0-31
AMD CDNA64256 bytes (4×64)6 steps: 64→32→16→8→4→2→1Lanes 0-63

Performance implications of 64 vs 32:

  • CDNA advantage: 2× memory bandwidth per warp
  • CDNA advantage: 2× computation per warp
  • NVIDIA/RDNA advantage: More warps per block (better occupancy)
  • Code portability: Same source, optimal performance on both

Memory access patterns with warps

Coalesced Memory Access Patterns

✅ PERFECT: Coalesced access (100% bandwidth utilization)

# Adjacent lanes → adjacent memory addresses
var value = input[global_i]  # Lane 0→input[0], Lane 1→input[1], etc.

Memory access patterns:

Access PatternNVIDIA/RDNA (32 lanes)CDNA (64 lanes)Bandwidth UtilizationPerformance
✅ CoalescedLane N → Address 4×NLane N → Address 4×N100%Optimal
1 transaction: 128 bytes1 transaction: 256 bytesFull bus widthFast
❌ ScatteredLane N → Random addressLane N → Random address~6%Terrible
32 separate transactions64 separate transactionsMostly idle bus32× slower

Example addresses:

  • Coalesced: Lane 0→0, Lane 1→4, Lane 2→8, Lane 3→12, …
  • Scattered: Lane 0→1000, Lane 1→52, Lane 2→997, Lane 3→8, …

Shared memory bank conflicts

What is a bank conflict?

Assume that a GPU shared memory is divided into 32 independent banks that can be accessed simultaneously. A bank conflict occurs when multiple threads in a warp try to access different addresses within the same bank at the same time. When this happens, the hardware must serialize these accesses, turning what should be a single-cycle operation into multiple cycles.

Key concepts:

  • No conflict: Each thread accesses a different bank → All accesses happen simultaneously (1 cycle)
  • Bank conflict: Multiple threads access the same bank → Accesses happen sequentially (N cycles for N threads)
  • Broadcast: All threads access the same address → Hardware optimizes this to 1 cycle

Shared memory bank organization:

BankAddresses (byte offsets)Example Data (float32)
Bank 00, 128, 256, 384, …shared[0], shared[32], shared[64], …
Bank 14, 132, 260, 388, …shared[1], shared[33], shared[65], …
Bank 28, 136, 264, 392, …shared[2], shared[34], shared[66], …
Bank 31124, 252, 380, 508, …shared[31], shared[63], shared[95], …

Bank conflict examples:

Access PatternBank UsageCyclesPerformanceExplanation
✅ Sequentialshared[thread_idx.x]1 cycle100%Each lane hits different bank
Lane 0→Bank 0, Lane 1→Bank 1, …OptimalNo conflicts
❌ Stride 2shared[thread_idx.x * 2]2 cycles50%2 lanes per bank
Lane 0,16→Bank 0; Lane 1,17→Bank 12× slowerSerialized access
💀 Same indexshared[0] (all lanes)32 cycles3%All lanes hit Bank 0
All 32 lanes→Bank 032× slowerCompletely serialized

Practical implications for warp programming

When warp operations are most effective

  1. Reduction operations: sum(), max(), etc.
  2. Broadcast operations: shuffle_idx() to share values
  3. Neighbor communication: shuffle_down() for sliding windows
  4. Prefix computations: prefix_sum() for scan algorithms

Performance characteristics

Operation TypeTraditionalWarp Operations
Reduction (32 elements)~10 instructions1 instruction
Memory trafficHighMinimal
Synchronization costExpensiveFree
Code complexityHighLow

Next steps

Now that you understand the SIMT foundation, you’re ready to see how these concepts enable powerful warp operations. The next section will show you how sum() transforms complex reduction patterns into simple, efficient function calls.

→ Continue to warp.sum() Essentials

⚡ warp.sum() Essentials - Warp-Level Dot Product

Implement the dot product we saw in puzzle 10 using Mojo’s warp operations to replace complex shared memory patterns with simple function calls. Each warp lane will process one element and use warp.sum() to combine results automatically, demonstrating how warp programming transforms GPU synchronization.

Key insight: The warp.sum() operation leverages SIMT execution to replace shared memory + barriers + tree reduction with a single hardware-accelerated instruction.

Key concepts

In this puzzle, you’ll master:

  • Warp-level reductions with warp.sum()
  • SIMT execution model and lane synchronization
  • Cross-architecture compatibility with WARP_SIZE
  • Performance transformation from complex to simple patterns
  • Lane ID management and conditional writes

The mathematical operation is a dot product (inner product): \[\Large \text{out}[0] = \sum_{i=0}^{N-1} a[i] \times b[i]\]

But the implementation teaches fundamental patterns for all warp-level GPU programming in Mojo.

Configuration

  • Vector size: SIZE = WARP_SIZE (32 or 64 depending on GPU architecture)
  • Data type: DType.float32
  • Block configuration: (WARP_SIZE, 1) threads per block
  • Grid configuration: (1, 1) blocks per grid
  • Layout: Layout.row_major(SIZE) (1D row-major)

The traditional complexity (from Puzzle 10)

Recall the complex approach from p10.mojo that required shared memory, barriers, and tree reduction:

alias SIZE = WARP_SIZE
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (WARP_SIZE, 1)  # optimal choice for warp kernel
alias dtype = DType.float32
alias SIMD_WIDTH = simdwidthof[dtype]()
alias in_layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(1)


fn traditional_dot_product_p10_style[
    in_layout: Layout, out_layout: Layout, size: Int
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
):
    """
    This is the complex approach from p10_layout_tensor.mojo - kept for comparison.
    """
    shared = tb[dtype]().row_major[WARP_SIZE]().shared().alloc()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = (a[global_i] * b[global_i]).reduce_add()
    else:
        shared[local_i] = 0.0

    barrier()

    stride = SIZE // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2

    if local_i == 0:
        out[0] = shared[0]


What makes this complex:

  • Shared memory allocation: Manual memory management within blocks
  • Explicit barriers: barrier() calls to synchronize threads
  • Tree reduction: Complex loop with stride-based indexing
  • Conditional writes: Only thread 0 writes the final result

This works, but it’s verbose, error-prone, and requires deep understanding of GPU synchronization.

Test the traditional approach:

uv run poe p21 --traditional
pixi run p21 --traditional

Code to complete

1. Simple warp kernel approach

Transform the complex traditional approach into a simple warp kernel using warp_sum():

from gpu.warp import sum as warp_sum


fn simple_warp_dot_product[
    in_layout: Layout, out_layout: Layout, size: Int
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    # FILL IN (6 lines at most)


View full file: problems/p21/p21.mojo

Tips

1. Understanding the simple warp kernel structure

You need to complete the simple_warp_dot_product function with 6 lines or fewer:

fn simple_warp_dot_product[...](out, a, b):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    # FILL IN (6 lines at most)

Pattern to follow:

  1. Compute partial product for this thread’s element
  2. Use warp_sum() to combine across all warp lanes
  3. Lane 0 writes the final result

2. Computing partial products

var partial_product: Scalar[dtype] = 0
if global_i < size:
    partial_product = (a[global_i] * b[global_i]).reduce_add()

Why .reduce_add()? Values in Mojo are SIMD-based, so a[global_i] * b[global_i] returns a SIMD vector. Use .reduce_add() to sum the vector into a scalar.

Bounds checking: Essential because not all threads may have valid data to process.

3. Warp reduction magic

total = warp_sum(partial_product)

What warp_sum() does:

  • Takes each lane’s partial_product value
  • Sums them across all lanes in the warp (hardware-accelerated)
  • Returns the same total to all lanes (not just lane 0)
  • Requires zero explicit synchronization (SIMT handles it)

4. Writing the result

if lane_id() == 0:
    out[0] = total

Why only lane 0? All lanes have the same total value after warp_sum(), but we only want to write once to avoid race conditions.

lane_id(): Returns 0-31 (NVIDIA) or 0-63 (AMD) - identifies which lane within the warp.

Test the simple warp kernel:

uv run poe p21 --kernel
pixi run p21 --kernel

Expected output when solved:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
=== RESULT ===
out: 10416.0
expected: 10416.0
🚀 Notice how simple the warp version is compared to p10.mojo!
   Same kernel structure, but warp_sum() replaces all the complexity!

Solution

fn simple_warp_dot_product[
    in_layout: Layout, out_layout: Layout, size: Int
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x

    # Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based
    var partial_product: Scalar[dtype] = 0
    if global_i < size:
        partial_product = (a[global_i] * b[global_i]).reduce_add()

    # warp_sum() replaces all the shared memory + barriers + tree reduction
    total = warp_sum(partial_product)

    # Only lane 0 writes the result (all lanes have the same total)
    if lane_id() == 0:
        out[0] = total


The simple warp kernel demonstrates the fundamental transformation from complex synchronization to hardware-accelerated primitives:

What disappeared from the traditional approach:

  • 15+ lines → 6 lines: Dramatic code reduction
  • Shared memory allocation: Zero memory management required
  • 3+ barrier() calls: Zero explicit synchronization
  • Complex tree reduction: Single function call
  • Stride-based indexing: Eliminated entirely

SIMT execution model:

Warp lanes (SIMT execution):
Lane 0: partial_product = a[0] * b[0]    = 0.0
Lane 1: partial_product = a[1] * b[1]    = 4.0
Lane 2: partial_product = a[2] * b[2]    = 16.0
...
Lane 31: partial_product = a[31] * b[31] = 3844.0

warp_sum() hardware operation:
All lanes → 0.0 + 4.0 + 16.0 + ... + 3844.0 = 10416.0
All lanes receive → total = 10416.0 (broadcast result)

Why this works without barriers:

  1. SIMT execution: All lanes execute each instruction simultaneously
  2. Hardware synchronization: When warp_sum() begins, all lanes have computed their partial_product
  3. Built-in communication: GPU hardware handles the reduction operation
  4. Broadcast result: All lanes receive the same total value

2. Functional approach

Now implement the same warp dot product using Mojo’s functional programming patterns:

fn functional_warp_dot_product[
    layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
    out: LayoutTensor[mut=True, dtype, Layout.row_major(1), MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn compute_dot_product[
        simd_width: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]
        print("idx:", idx)
        # FILL IN (10 lines at most)

    # Launch exactly WARP_SIZE threads (one warp) to process all elements
    elementwise[compute_dot_product, 1, target="gpu"](WARP_SIZE, ctx)


Tips

1. Understanding the functional approach structure

You need to complete the compute_dot_product function with 10 lines or fewer:

@parameter
@always_inline
fn compute_dot_product[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
    idx = indices[0]
    # FILL IN (10 lines at most)

Functional pattern differences:

  • Uses elementwise to launch exactly WARP_SIZE threads
  • Each thread processes one element based on idx
  • Same warp operations, different launch mechanism

2. Computing partial products

var partial_product: Scalar[dtype] = 0.0
if idx < size:
    a_val = a.load[1](idx, 0)
    b_val = b.load[1](idx, 0)
    partial_product = (a_val * b_val).reduce_add()
else:
    partial_product = 0.0

Loading pattern: a.load[1](idx, 0) loads exactly 1 element at position idx (not SIMD vectorized).

Bounds handling: Set partial_product = 0.0 for out-of-bounds threads so they don’t contribute to the sum.

3. Warp operations and storing

total = warp_sum(partial_product)

if lane_id() == 0:
    out.store[1](0, 0, total)

Storage pattern: out.store[1](0, 0, total) stores 1 element at position (0, 0) in the output tensor.

Same warp logic: warp_sum() and lane 0 writing work identically in functional approach.

4. Available functions from imports

from gpu import lane_id
from gpu.warp import sum as warp_sum, WARP_SIZE

# Inside your function:
my_lane = lane_id()           # 0 to WARP_SIZE-1
total = warp_sum(my_value)    # Hardware-accelerated reduction
warp_size = WARP_SIZE         # 32 (NVIDIA) or 64 (AMD)

Test the functional approach:

uv run poe p21 --functional
pixi run p21 --functional

Expected output when solved:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
=== RESULT ===
out: 10416.0
expected: 10416.0
🔧 Functional approach shows modern Mojo style with warp operations!
   Clean, composable, and still leverages warp hardware primitives!

Solution

fn functional_warp_dot_product[
    layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
    out: LayoutTensor[mut=True, dtype, Layout.row_major(1), MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn compute_dot_product[
        simd_width: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]

        # Each thread computes one partial product
        var partial_product: Scalar[dtype] = 0.0
        if idx < size:
            a_val = a.load[1](idx, 0)
            b_val = b.load[1](idx, 0)
            partial_product = (a_val * b_val).reduce_add()
        else:
            partial_product = 0.0

        # Warp magic - combines all WARP_SIZE partial products!
        total = warp_sum(partial_product)

        # Only lane 0 writes the result (all lanes have the same total)
        if lane_id() == 0:
            out.store[1](0, 0, total)

    # Launch exactly WARP_SIZE threads (one warp) to process all elements
    elementwise[compute_dot_product, 1, target="gpu"](WARP_SIZE, ctx)


The functional warp approach showcases modern Mojo programming patterns with warp operations:

Functional approach characteristics:

elementwise[compute_dot_product, 1, target="gpu"](WARP_SIZE, ctx)

Benefits:

  • Type safety: Compile-time tensor layout checking
  • Composability: Easy integration with other functional operations
  • Modern patterns: Leverages Mojo’s functional programming features
  • Automatic optimization: Compiler can apply high-level optimizations

Key differences from kernel approach:

  • Launch mechanism: Uses elementwise instead of enqueue_function
  • Memory access: Uses .load[1]() and .store[1]() patterns
  • Integration: Seamlessly works with other functional operations

Same warp benefits:

  • Zero synchronization: warp_sum() works identically
  • Hardware acceleration: Same performance as kernel approach
  • Cross-architecture: WARP_SIZE adapts automatically

Performance comparison with benchmarks

Run comprehensive benchmarks to see how warp operations scale:

uv run poe p21 --benchmark
pixi run p21 --benchmark

Here’s example output from a complete benchmark run:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
--------------------------------------------------------------------------------
Testing SIZE=1 x WARP_SIZE, BLOCKS=1
Running traditional_1x
Running simple_warp_1x
Running functional_warp_1x
--------------------------------------------------------------------------------
Testing SIZE=4 x WARP_SIZE, BLOCKS=4
Running traditional_4x
Running simple_warp_4x
Running functional_warp_4x
--------------------------------------------------------------------------------
Testing SIZE=32 x WARP_SIZE, BLOCKS=32
Running traditional_32x
Running simple_warp_32x
Running functional_warp_32x
--------------------------------------------------------------------------------
Testing SIZE=256 x WARP_SIZE, BLOCKS=256
Running traditional_256x
Running simple_warp_256x
Running functional_warp_256x
--------------------------------------------------------------------------------
Testing SIZE=2048 x WARP_SIZE, BLOCKS=2048
Running traditional_2048x
Running simple_warp_2048x
Running functional_warp_2048x
--------------------------------------------------------------------------------
Testing SIZE=16384 x WARP_SIZE, BLOCKS=16384 (Large Scale)
Running traditional_16384x
Running simple_warp_16384x
Running functional_warp_16384x
--------------------------------------------------------------------------------
Testing SIZE=65536 x WARP_SIZE, BLOCKS=65536 (Massive Scale)
Running traditional_65536x
Running simple_warp_65536x
Running functional_warp_65536x
-------------------------------------------------------
| name                   | met (ms)           | iters |
-------------------------------------------------------
| traditional_1x         | 3.5565388994708993 | 378   |
| simple_warp_1x         | 3.1609036200000005 | 100   |
| functional_warp_1x     | 3.22122741         | 100   |
| traditional_4x         | 3.1741644200000003 | 100   |
| simple_warp_4x         | 4.6268518          | 100   |
| functional_warp_4x     | 3.18364685         | 100   |
| traditional_32x        | 3.19311859         | 100   |
| simple_warp_32x        | 3.18385162         | 100   |
| functional_warp_32x    | 3.18260223         | 100   |
| traditional_256x       | 4.704542839999999  | 100   |
| simple_warp_256x       | 3.599057930294906  | 373   |
| functional_warp_256x   | 3.21388549         | 100   |
| traditional_2048x      | 3.31929595         | 100   |
| simple_warp_2048x      | 4.80178161         | 100   |
| functional_warp_2048x  | 3.734744261111111  | 360   |
| traditional_16384x     | 6.39709167         | 100   |
| simple_warp_16384x     | 7.8748059          | 100   |
| functional_warp_16384x | 7.848806150000001  | 100   |
| traditional_65536x     | 25.155625274509806 | 51    |
| simple_warp_65536x     | 25.10668252830189  | 53    |
| functional_warp_65536x | 25.053512849056602 | 53    |
-------------------------------------------------------

Benchmarks completed!

🚀 WARP OPERATIONS PERFORMANCE ANALYSIS:
   GPU Architecture: NVIDIA (WARP_SIZE=32) vs AMD (WARP_SIZE=64)
   - 1 x WARP_SIZE: Single warp baseline
   - 4 x WARP_SIZE: Few warps, warp overhead visible
   - 32 x WARP_SIZE: Medium scale, warp benefits emerge
   - 256 x WARP_SIZE: Large scale, dramatic warp advantages
   - 2048 x WARP_SIZE: Massive scale, warp operations dominate
   - 16384 x WARP_SIZE: Large scale (512K-1M elements)
   - 65536 x WARP_SIZE: Massive scale (2M-4M elements)
   - Note: AMD GPUs process 2 x elements per warp vs NVIDIA!

   Expected Results at Large Scales:
   • Traditional: Slower due to more barrier overhead
   • Warp operations: Faster, scale better with problem size
   • Memory bandwidth becomes the limiting factor

Performance insights from this example:

  • Small scales (1x-4x): Warp operations show modest improvements (~10-15% faster)
  • Medium scale (32x-256x): Functional approach often performs best
  • Large scales (16K-65K): All approaches converge as memory bandwidth dominates
  • Variability: Performance depends heavily on specific GPU architecture and memory subsystem

Note: Your results will vary significantly depending on your hardware (GPU model, memory bandwidth, WARP_SIZE). The key insight is observing the relative performance trends rather than absolute timings.

Next Steps

Once you’ve mastered warp sum operations, you’re ready for:

  • 📊 When to Use Warp Programming: Strategic decision framework for warp vs traditional approaches
  • Advanced warp operations: shuffle_idx(), shuffle_down(), prefix_sum() for complex communication patterns
  • Multi-warp algorithms: Combining warp operations with block-level synchronization
  • Part VII: Memory Coalescing: Optimizing memory access patterns for maximum bandwidth

💡 Key Takeaway: Warp operations transform GPU programming by replacing complex synchronization patterns with hardware-accelerated primitives, demonstrating how understanding the execution model enables dramatic simplification without sacrificing performance.

📊 When to Use Warp Programming

Quick decision guide

✅ Use warp operations when:

  • Reduction operations (sum, max, min) with 32+ elements
  • Regular memory access patterns (adjacent lanes → adjacent addresses)
  • Need cross-architecture portability (NVIDIA/RDNA 32 vs CDNA 64 threads)
  • Want simpler, more maintainable code

❌ Use traditional approaches when:

  • Complex cross-warp synchronization required
  • Irregular/scattered memory access patterns
  • Variable work per thread (causes warp divergence)
  • Problem size < WARP_SIZE

Performance characteristics

Problem size scaling

ElementsWarp AdvantageNotes
< 32NoneTraditional better
32-1K1.2-1.5×Sweet spot begins
1K-32K1.5-2.5×Warp operations excel
> 32KMemory-boundBoth approaches limited by bandwidth

Key warp advantages

  • No synchronization overhead: Eliminates barrier costs
  • Minimal memory usage: No shared memory allocation needed
  • Better scaling: Performance improves with more warps
  • Simpler code: Fewer lines, less error-prone

Algorithm-specific guidance

AlgorithmRecommendationReason
Dot productWarp ops (1K+ elements)Single reduction, regular access
Matrix row/col sumWarp opsNatural reduction pattern
Prefix sumAlways warp prefix_sum()Hardware-optimized primitive
Pooling (max/min)Warp ops (regular windows)Efficient window reductions
HistogramTraditionalIrregular writes, atomic updates

Code examples

✅ Perfect for warps

# Reduction operations
from gpu.warp import sum, max
var total = sum(partial_values)
var maximum = max(partial_values)

# Communication patterns
from gpu.warp import shuffle_idx, prefix_sum
var broadcast = shuffle_idx(my_value, 0)
var running_sum = prefix_sum(my_value)

❌ Better with traditional approaches

# Complex multi-stage synchronization
stage1_compute()
barrier()  # Need ALL threads to finish
stage2_depends_on_stage1()

# Irregular memory access
var value = input[random_indices[global_i]]  # Scattered reads

# Data-dependent work
if input[global_i] > threshold:
    result = expensive_computation()  # Causes warp divergence

Performance measurement

# Always benchmark both approaches
mojo p21.mojo --benchmark

# Look for scaling patterns:
# traditional_1x:  X.XX ms
# warp_1x:         Y.YY ms  # Should be faster
# warp_32x:        Z.ZZ ms  # Advantage should increase

Summary

Start with warp operations for:

  • Reductions with regular access patterns
  • Problems ≥ 1 warp in size
  • Cross-platform compatibility needs

Use traditional approaches for:

  • Complex synchronization requirements
  • Irregular memory patterns
  • Small problems or heavy divergence

When in doubt: Implement both and benchmark. The performance difference will guide your decision.