🔰 Elementwise - Basic GPU Functional Operations

Implement a kernel that adds two vectors element-wise using Mojo’s functional elementwise pattern. Each thread will process multiple SIMD elements automatically, demonstrating how modern GPU programming abstracts away low-level details while maintaining high performance.

Key insight: The elementwise function automatically handles thread management, SIMD vectorization, and memory coalescing for you.

Key concepts

In this puzzle, you’ll master:

  • Functional GPU programming with elementwise
  • Automatic SIMD vectorization within GPU threads
  • LayoutTensor operations for safe memory access
  • GPU thread hierarchy vs SIMD operations
  • Capturing semantics in nested functions

The mathematical operation is simple element-wise addition: \[\Large \text{out}[i] = a[i] + b[i]\]

But the implementation teaches fundamental patterns for all GPU functional programming in Mojo.

Configuration

  • Vector size: SIZE = 1024
  • Data type: DType.float32
  • SIMD width: Target-dependent (determined by GPU architecture and data type)
  • Layout: Layout.row_major(SIZE) (1D row-major)

Code to complete

alias SIZE = 1024
alias rank = 1
alias layout = Layout.row_major(SIZE)
alias dtype = DType.float32
alias SIMD_WIDTH = simdwidthof[dtype, target = _get_gpu_target()]()


fn elementwise_add[
    layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn add[
        simd_width: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]
        print("idx:", idx)
        # FILL IN (2 to 4 lines)

    elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx)


View full file: problems/p20/p20.mojo

Tips

1. Understanding the function structure

The elementwise function expects a nested function with this exact signature:

@parameter
@always_inline
fn your_function[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
    # Your implementation here

Why each part matters:

  • @parameter: Enables compile-time specialization for optimal GPU code generation
  • @always_inline: Forces inlining to eliminate function call overhead in GPU kernels
  • capturing: Allows access to variables from the outer scope (the input/output tensors)
  • IndexList[rank]: Provides multi-dimensional indexing (rank=1 for vectors, rank=2 for matrices)

2. Index extraction and SIMD processing

idx = indices[0]  # Extract linear index for 1D operations

This idx represents the starting position for a SIMD vector, not a single element. If SIMD_WIDTH=4 (GPU-dependent), then:

  • Thread 0 processes elements [0, 1, 2, 3] starting at idx=0
  • Thread 1 processes elements [4, 5, 6, 7] starting at idx=4
  • Thread 2 processes elements [8, 9, 10, 11] starting at idx=8
  • And so on…

3. SIMD loading pattern

a_simd = a.load[simd_width](idx, 0)  # Load 4 consecutive floats (GPU-dependent)
b_simd = b.load[simd_width](idx, 0)  # Load 4 consecutive floats (GPU-dependent)

The second parameter 0 is the dimension offset (always 0 for 1D vectors). This loads a vectorized chunk of data in a single operation. The exact number of elements loaded depends on your GPU’s SIMD capabilities.

4. Vector arithmetic

result = a_simd + b_simd  # SIMD addition of 4 elements simultaneously (GPU-dependent)

This performs element-wise addition across the entire SIMD vector in parallel - much faster than 4 separate scalar additions.

5. SIMD storing

out.store[simd_width](idx, 0, result)  # Store 4 results at once (GPU-dependent)

Writes the entire SIMD vector back to memory in one operation.

6. Calling the elementwise function

elementwise[your_function, SIMD_WIDTH, target="gpu"](total_size, ctx)
  • total_size should be a.size() to process all elements
  • The GPU automatically determines how many threads to launch: total_size // SIMD_WIDTH

7. Key debugging insight

Notice the print("idx:", idx) in the template. When you run it, you’ll see:

idx: 0, idx: 4, idx: 8, idx: 12, ...

This shows that each thread handles a different SIMD chunk, automatically spaced by SIMD_WIDTH (which is GPU-dependent).

Running the code

To test your solution, run the following command in your terminal:

uv run poe p20 --elementwise
pixi run p20 --elementwise

Your output will look like this if the puzzle isn’t solved yet:

SIZE: 1024
simd_width: 4
...
idx: 404
idx: 408
idx: 412
idx: 416
...

out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

Solution

fn elementwise_add[
    layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
    output: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn add[
        simd_width: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]
        # Note: This is thread-local SIMD - each thread processes its own vector of data
        # we'll later better see this hierarchy in Mojo:
        # SIMD within threads, warp across threads, block across warps
        a_simd = a.load[simd_width](idx, 0)
        b_simd = b.load[simd_width](idx, 0)
        ret = a_simd + b_simd
        # print(
        #     "idx:", idx, ", a_simd:", a_simd, ", b_simd:", b_simd, " sum:", ret
        # )
        output.store[simd_width](idx, 0, ret)

    elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx)


The elementwise functional pattern in Mojo demonstrates several fundamental concepts for modern GPU programming:

1. Functional abstraction philosophy

The elementwise function represents a paradigm shift from traditional GPU programming:

Traditional CUDA/HIP approach:

// Manual thread management
idx = thread_idx.x + block_idx.x * block_dim.x
if idx < size:
    out[idx] = a[idx] + b[idx];  // Scalar operation

Mojo functional approach:

// Automatic management + SIMD vectorization
elementwise[add_function, simd_width, target="gpu"](size, ctx)

What elementwise abstracts away:

  • Thread grid configuration: No need to calculate block/grid dimensions
  • Bounds checking: Automatic handling of array boundaries
  • Memory coalescing: Optimal memory access patterns built-in
  • SIMD orchestration: Vectorization handled transparently
  • GPU target selection: Works across different GPU architectures

2. Deep dive: nested function architecture

@parameter
@always_inline
fn add[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:

Parameter Analysis:

  • @parameter: This decorator enables compile-time specialization. The function is generated separately for each unique simd_width and rank, allowing aggressive optimization.
  • @always_inline: Critical for GPU performance - eliminates function call overhead by embedding the code directly into the kernel.
  • capturing: Enables lexical scoping - the inner function can access variables from the outer scope without explicit parameter passing.
  • IndexList[rank]: Provides dimension-agnostic indexing - the same pattern works for 1D vectors, 2D matrices, 3D tensors, etc.

3. SIMD execution model deep dive

idx = indices[0]                          // Linear index: 0, 4, 8, 12... (GPU-dependent spacing)
a_simd = a.load[simd_width](idx, 0)      // Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
b_simd = b.load[simd_width](idx, 0)      // Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
ret = a_simd + b_simd                    // SIMD: 4 additions in parallel (GPU-dependent)
out.store[simd_width](idx, 0, ret)       // Store: 4 results simultaneously (GPU-dependent)

Execution Hierarchy Visualization:

GPU Architecture:
├── Grid (entire problem)
│   ├── Block 1 (multiple warps)
│   │   ├── Warp 1 (32 threads) --> We'll learn about Warp in the next Part VI
│   │   │   ├── Thread 1 → SIMD[4 elements]  ← Our focus (GPU-dependent width)
│   │   │   ├── Thread 2 → SIMD[4 elements]
│   │   │   └── ...
│   │   └── Warp 2 (32 threads)
│   └── Block 2 (multiple warps)

For a 1024-element vector with SIMD_WIDTH=4 (example GPU):

  • Total SIMD operations needed: 1024 ÷ 4 = 256
  • GPU launches: 256 threads (1024 ÷ 4)
  • Each thread processes: Exactly 4 consecutive elements
  • Memory bandwidth: SIMD_WIDTH× improvement over scalar operations

Note: SIMD width varies by GPU architecture (e.g., 4 for some GPUs, 8 for RTX 4090, 16 for A100).

4. Memory access pattern analysis

a.load[simd_width](idx, 0)  // Coalesced memory access

Memory Coalescing Benefits:

  • Sequential access: Threads access consecutive memory locations
  • Cache optimization: Maximizes L1/L2 cache hit rates
  • Bandwidth utilization: Achieves near-theoretical memory bandwidth
  • Hardware efficiency: GPU memory controllers optimized for this pattern

Example for SIMD_WIDTH=4 (GPU-dependent):

Thread 0: loads a[0:4]   → Memory bank 0-3
Thread 1: loads a[4:8]   → Memory bank 4-7
Thread 2: loads a[8:12]  → Memory bank 8-11
...
Result: Optimal memory controller utilization

5. Performance characteristics & optimization

Computational Intensity Analysis (for SIMD_WIDTH=4):

  • Arithmetic operations: 1 SIMD addition per 4 elements
  • Memory operations: 2 SIMD loads + 1 SIMD store per 4 elements
  • Arithmetic intensity: 1 add ÷ 3 memory ops = 0.33 (memory-bound)

Why This Is Memory-Bound:

Memory bandwidth >>> Compute capability for simple operations

Optimization Implications:

  • Focus on memory access patterns rather than arithmetic optimization
  • SIMD vectorization provides the primary performance benefit
  • Memory coalescing is critical for performance
  • Cache locality matters more than computational complexity

6. Scaling and adaptability

Automatic Hardware Adaptation:

alias SIMD_WIDTH = simdwidthof[dtype, target = _get_gpu_target()]()
  • GPU-specific optimization: SIMD width adapts to hardware (e.g., 4 for some cards, 8 for RTX 4090, 16 for A100)
  • Data type awareness: Different SIMD widths for float32 vs float16
  • Compile-time optimization: Zero runtime overhead for hardware detection

Scalability Properties:

  • Thread count: Automatically scales with problem size
  • Memory usage: Linear scaling with input size
  • Performance: Near-linear speedup until memory bandwidth saturation

7. Advanced insights: why this pattern matters

Foundation for Complex Operations: This elementwise pattern is the building block for:

  • Reduction operations: Sum, max, min across large arrays
  • Broadcast operations: Scalar-to-vector operations
  • Complex transformations: Activation functions, normalization
  • Multi-dimensional operations: Matrix operations, convolutions

Compared to Traditional Approaches:

// Traditional: Error-prone, verbose, hardware-specific
__global__ void add_kernel(float* out, float* a, float* b, int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        out[idx] = a[idx] + b[idx];  // No vectorization
    }
}

// Mojo: Safe, concise, automatically vectorized
elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx)

Benefits of Functional Approach:

  • Safety: Automatic bounds checking prevents buffer overruns
  • Portability: Same code works across GPU vendors/generations
  • Performance: Compiler optimizations often exceed hand-tuned code
  • Maintainability: Clean abstractions reduce debugging complexity
  • Composability: Easy to combine with other functional operations

This pattern represents the future of GPU programming - high-level abstractions that don’t sacrifice performance, making GPU computing accessible while maintaining optimal efficiency.

Next Steps

Once you’ve mastered elementwise operations, you’re ready for:

💡 Key Takeaway: The elementwise pattern demonstrates how Mojo combines functional programming elegance with GPU performance, automatically handling vectorization and thread management while maintaining full control over the computation.