🔰 Elementwise - Basic GPU Functional Operations
Implement a kernel that adds two vectors element-wise using Mojo’s functional elementwise
pattern. Each thread will process multiple SIMD elements automatically, demonstrating how modern GPU programming abstracts away low-level details while maintaining high performance.
Key insight: The elementwise function automatically handles thread management, SIMD vectorization, and memory coalescing for you.
Key concepts
In this puzzle, you’ll master:
- Functional GPU programming with
elementwise
- Automatic SIMD vectorization within GPU threads
- LayoutTensor operations for safe memory access
- GPU thread hierarchy vs SIMD operations
- Capturing semantics in nested functions
The mathematical operation is simple element-wise addition: \[\Large \text{out}[i] = a[i] + b[i]\]
But the implementation teaches fundamental patterns for all GPU functional programming in Mojo.
Configuration
- Vector size:
SIZE = 1024
- Data type:
DType.float32
- SIMD width: Target-dependent (determined by GPU architecture and data type)
- Layout:
Layout.row_major(SIZE)
(1D row-major)
Code to complete
alias SIZE = 1024
alias rank = 1
alias layout = Layout.row_major(SIZE)
alias dtype = DType.float32
alias SIMD_WIDTH = simdwidthof[dtype, target = _get_gpu_target()]()
fn elementwise_add[
layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
@always_inline
fn add[
simd_width: Int, rank: Int
](indices: IndexList[rank]) capturing -> None:
idx = indices[0]
print("idx:", idx)
# FILL IN (2 to 4 lines)
elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx)
View full file: problems/p20/p20.mojo
Tips
1. Understanding the function structure
The elementwise
function expects a nested function with this exact signature:
@parameter
@always_inline
fn your_function[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
# Your implementation here
Why each part matters:
@parameter
: Enables compile-time specialization for optimal GPU code generation@always_inline
: Forces inlining to eliminate function call overhead in GPU kernelscapturing
: Allows access to variables from the outer scope (the input/output tensors)IndexList[rank]
: Provides multi-dimensional indexing (rank=1 for vectors, rank=2 for matrices)
2. Index extraction and SIMD processing
idx = indices[0] # Extract linear index for 1D operations
This idx
represents the starting position for a SIMD vector, not a single element. If SIMD_WIDTH=4
(GPU-dependent), then:
- Thread 0 processes elements
[0, 1, 2, 3]
starting atidx=0
- Thread 1 processes elements
[4, 5, 6, 7]
starting atidx=4
- Thread 2 processes elements
[8, 9, 10, 11]
starting atidx=8
- And so on…
3. SIMD loading pattern
a_simd = a.load[simd_width](idx, 0) # Load 4 consecutive floats (GPU-dependent)
b_simd = b.load[simd_width](idx, 0) # Load 4 consecutive floats (GPU-dependent)
The second parameter 0
is the dimension offset (always 0 for 1D vectors). This loads a vectorized chunk of data in a single operation. The exact number of elements loaded depends on your GPU’s SIMD capabilities.
4. Vector arithmetic
result = a_simd + b_simd # SIMD addition of 4 elements simultaneously (GPU-dependent)
This performs element-wise addition across the entire SIMD vector in parallel - much faster than 4 separate scalar additions.
5. SIMD storing
out.store[simd_width](idx, 0, result) # Store 4 results at once (GPU-dependent)
Writes the entire SIMD vector back to memory in one operation.
6. Calling the elementwise function
elementwise[your_function, SIMD_WIDTH, target="gpu"](total_size, ctx)
total_size
should bea.size()
to process all elements- The GPU automatically determines how many threads to launch:
total_size // SIMD_WIDTH
7. Key debugging insight
Notice the print("idx:", idx)
in the template. When you run it, you’ll see:
idx: 0, idx: 4, idx: 8, idx: 12, ...
This shows that each thread handles a different SIMD chunk, automatically spaced by SIMD_WIDTH
(which is GPU-dependent).
Running the code
To test your solution, run the following command in your terminal:
uv run poe p20 --elementwise
pixi run p20 --elementwise
Your output will look like this if the puzzle isn’t solved yet:
SIZE: 1024
simd_width: 4
...
idx: 404
idx: 408
idx: 412
idx: 416
...
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])
Solution
fn elementwise_add[
layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
output: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
@always_inline
fn add[
simd_width: Int, rank: Int
](indices: IndexList[rank]) capturing -> None:
idx = indices[0]
# Note: This is thread-local SIMD - each thread processes its own vector of data
# we'll later better see this hierarchy in Mojo:
# SIMD within threads, warp across threads, block across warps
a_simd = a.load[simd_width](idx, 0)
b_simd = b.load[simd_width](idx, 0)
ret = a_simd + b_simd
# print(
# "idx:", idx, ", a_simd:", a_simd, ", b_simd:", b_simd, " sum:", ret
# )
output.store[simd_width](idx, 0, ret)
elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx)
The elementwise functional pattern in Mojo demonstrates several fundamental concepts for modern GPU programming:
1. Functional abstraction philosophy
The elementwise
function represents a paradigm shift from traditional GPU programming:
Traditional CUDA/HIP approach:
// Manual thread management
idx = thread_idx.x + block_idx.x * block_dim.x
if idx < size:
out[idx] = a[idx] + b[idx]; // Scalar operation
Mojo functional approach:
// Automatic management + SIMD vectorization
elementwise[add_function, simd_width, target="gpu"](size, ctx)
What elementwise
abstracts away:
- Thread grid configuration: No need to calculate block/grid dimensions
- Bounds checking: Automatic handling of array boundaries
- Memory coalescing: Optimal memory access patterns built-in
- SIMD orchestration: Vectorization handled transparently
- GPU target selection: Works across different GPU architectures
2. Deep dive: nested function architecture
@parameter
@always_inline
fn add[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
Parameter Analysis:
@parameter
: This decorator enables compile-time specialization. The function is generated separately for each uniquesimd_width
andrank
, allowing aggressive optimization.@always_inline
: Critical for GPU performance - eliminates function call overhead by embedding the code directly into the kernel.capturing
: Enables lexical scoping - the inner function can access variables from the outer scope without explicit parameter passing.IndexList[rank]
: Provides dimension-agnostic indexing - the same pattern works for 1D vectors, 2D matrices, 3D tensors, etc.
3. SIMD execution model deep dive
idx = indices[0] // Linear index: 0, 4, 8, 12... (GPU-dependent spacing)
a_simd = a.load[simd_width](idx, 0) // Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
b_simd = b.load[simd_width](idx, 0) // Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
ret = a_simd + b_simd // SIMD: 4 additions in parallel (GPU-dependent)
out.store[simd_width](idx, 0, ret) // Store: 4 results simultaneously (GPU-dependent)
Execution Hierarchy Visualization:
GPU Architecture:
├── Grid (entire problem)
│ ├── Block 1 (multiple warps)
│ │ ├── Warp 1 (32 threads) --> We'll learn about Warp in the next Part VI
│ │ │ ├── Thread 1 → SIMD[4 elements] ← Our focus (GPU-dependent width)
│ │ │ ├── Thread 2 → SIMD[4 elements]
│ │ │ └── ...
│ │ └── Warp 2 (32 threads)
│ └── Block 2 (multiple warps)
For a 1024-element vector with SIMD_WIDTH=4 (example GPU):
- Total SIMD operations needed: 1024 ÷ 4 = 256
- GPU launches: 256 threads (1024 ÷ 4)
- Each thread processes: Exactly 4 consecutive elements
- Memory bandwidth: SIMD_WIDTH× improvement over scalar operations
Note: SIMD width varies by GPU architecture (e.g., 4 for some GPUs, 8 for RTX 4090, 16 for A100).
4. Memory access pattern analysis
a.load[simd_width](idx, 0) // Coalesced memory access
Memory Coalescing Benefits:
- Sequential access: Threads access consecutive memory locations
- Cache optimization: Maximizes L1/L2 cache hit rates
- Bandwidth utilization: Achieves near-theoretical memory bandwidth
- Hardware efficiency: GPU memory controllers optimized for this pattern
Example for SIMD_WIDTH=4 (GPU-dependent):
Thread 0: loads a[0:4] → Memory bank 0-3
Thread 1: loads a[4:8] → Memory bank 4-7
Thread 2: loads a[8:12] → Memory bank 8-11
...
Result: Optimal memory controller utilization
5. Performance characteristics & optimization
Computational Intensity Analysis (for SIMD_WIDTH=4):
- Arithmetic operations: 1 SIMD addition per 4 elements
- Memory operations: 2 SIMD loads + 1 SIMD store per 4 elements
- Arithmetic intensity: 1 add ÷ 3 memory ops = 0.33 (memory-bound)
Why This Is Memory-Bound:
Memory bandwidth >>> Compute capability for simple operations
Optimization Implications:
- Focus on memory access patterns rather than arithmetic optimization
- SIMD vectorization provides the primary performance benefit
- Memory coalescing is critical for performance
- Cache locality matters more than computational complexity
6. Scaling and adaptability
Automatic Hardware Adaptation:
alias SIMD_WIDTH = simdwidthof[dtype, target = _get_gpu_target()]()
- GPU-specific optimization: SIMD width adapts to hardware (e.g., 4 for some cards, 8 for RTX 4090, 16 for A100)
- Data type awareness: Different SIMD widths for float32 vs float16
- Compile-time optimization: Zero runtime overhead for hardware detection
Scalability Properties:
- Thread count: Automatically scales with problem size
- Memory usage: Linear scaling with input size
- Performance: Near-linear speedup until memory bandwidth saturation
7. Advanced insights: why this pattern matters
Foundation for Complex Operations: This elementwise pattern is the building block for:
- Reduction operations: Sum, max, min across large arrays
- Broadcast operations: Scalar-to-vector operations
- Complex transformations: Activation functions, normalization
- Multi-dimensional operations: Matrix operations, convolutions
Compared to Traditional Approaches:
// Traditional: Error-prone, verbose, hardware-specific
__global__ void add_kernel(float* out, float* a, float* b, int size) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
out[idx] = a[idx] + b[idx]; // No vectorization
}
}
// Mojo: Safe, concise, automatically vectorized
elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx)
Benefits of Functional Approach:
- Safety: Automatic bounds checking prevents buffer overruns
- Portability: Same code works across GPU vendors/generations
- Performance: Compiler optimizations often exceed hand-tuned code
- Maintainability: Clean abstractions reduce debugging complexity
- Composability: Easy to combine with other functional operations
This pattern represents the future of GPU programming - high-level abstractions that don’t sacrifice performance, making GPU computing accessible while maintaining optimal efficiency.
Next Steps
Once you’ve mastered elementwise operations, you’re ready for:
- ⚡ Tile Operations: Memory-efficient tiled processing patterns
- 🔧 Vectorization: Fine-grained SIMD control
- 🧠 GPU Threading vs SIMD: Understanding the execution hierarchy
- 📊 Benchmarking: Performance analysis and optimization
💡 Key Takeaway: The elementwise
pattern demonstrates how Mojo combines functional programming elegance with GPU performance, automatically handling vectorization and thread management while maintaining full control over the computation.