🧠 Warp Lanes & SIMT Execution

Mental model for warp programming vs SIMD

What is a warp?

A warp is a group of 32 (or 64) GPU threads that execute the same instruction at the same time on different data. Think of it as a synchronized vector unit where each thread acts like a β€œlane” in a vector processor.

Simple example:

from gpu.warp import sum
# All 32 threads in the warp execute this simultaneously:
var my_value = input[my_thread_id]     # Each gets different data
var warp_total = sum(my_value)    # All contribute to one sum

What just happened? Instead of 32 separate threads doing complex coordination, the warp automatically synchronized them to produce a single result. This is SIMT (Single Instruction, Multiple Thread) execution.

SIMT vs SIMD comparison

If you’re familiar with CPU vector programming (SIMD), GPU warps are similar but with key differences:

AspectCPU SIMD (e.g., AVX)GPU Warp (SIMT)
Programming modelExplicit vector operationsThread-based programming
Data widthFixed (256/512 bits)Flexible (32/64 threads)
SynchronizationImplicit within instructionImplicit within warp
CommunicationVia memory/registersVia shuffle operations
Divergence handlingNot applicableHardware masking
Examplea + bsum(thread_value)

CPU SIMD approach (C++ intrinsics):

// Explicit vector operations - say 8 floats in parallel
__m256 result = _mm256_add_ps(a, b);   // Add 8 pairs simultaneously

CPU SIMD approach (Mojo):

# SIMD in Mojo is first class citizen type so if a, b are of type SIMD then
# addition 8 floats in parallel
var result = a + b # Add 8 pairs simultaneously

GPU SIMT approach (Mojo):

# Thread-based code that becomes vector operations
from gpu.warp import sum

var my_data = input[thread_id]         # Each thread gets its element
var partial = my_data * coefficient    # All threads compute simultaneously
var total = sum(partial)               # Hardware coordinates the sum

Core concepts that make warps powerful

1. Lane identity: Each thread has a β€œlane ID” (0 to 31) that’s essentially free to access

var my_lane = lane_id()  # Just reading a hardware register

2. Implicit synchronization: No barriers needed within a warp

# This just works - all threads automatically synchronized
var sum = sum(my_contribution)

3. Efficient communication: Threads can share data without memory

# Get value from lane 0 to all other lanes
var broadcasted = shuffle_idx(my_value, 0)

Key insight: SIMT lets you write natural thread code that executes as efficient vector operations, combining the ease of thread programming with the performance of vector processing.

Where warps fit in GPU execution hierarchy

For complete context on how warps relate to the overall GPU execution model, see GPU Threading vs SIMD. Here’s where warps fit:

GPU Device
β”œβ”€β”€ Grid (your entire problem)
β”‚   β”œβ”€β”€ Block 1 (group of threads, shared memory)
β”‚   β”‚   β”œβ”€β”€ Warp 1 (32 threads, lockstep execution) ← This level
β”‚   β”‚   β”‚   β”œβ”€β”€ Thread 1 β†’ SIMD operations
β”‚   β”‚   β”‚   β”œβ”€β”€ Thread 2 β†’ SIMD operations
β”‚   β”‚   β”‚   └── ... (32 threads total)
β”‚   β”‚   └── Warp 2 (32 threads)
β”‚   └── Block 2 (independent group)

Warp programming operates at the β€œWarp level” - you work with operations that coordinate all 32 threads within a single warp, enabling powerful primitives like sum() that would otherwise require complex shared memory coordination.

This mental model helps you recognize when problems map naturally to warp operations versus requiring traditional shared memory approaches.

The hardware foundation of warp programming

Understanding Single Instruction, Multiple Thread (SIMT) execution is crucial for effective warp programming. This isn’t just a software abstraction - it’s how GPU hardware actually works at the silicon level.

What is SIMT execution?

SIMT means that within a warp, all threads execute the same instruction at the same time on different data. This is fundamentally different from CPU threads, which can execute completely different instructions independently.

CPU vs GPU Execution Models

AspectCPU (MIMD)GPU Warp (SIMT)
Instruction ModelMultiple Instructions, Multiple DataSingle Instruction, Multiple Thread
Core 1add r1, r2add r1, r2
Core 2load r3, [mem]add r1, r2 (same instruction)
Core 3branch loopadd r1, r2 (same instruction)
… Core 32different instructionadd r1, r2 (same instruction)
ExecutionIndependent, asynchronousSynchronized, lockstep
SchedulingComplex, OS-managedSimple, hardware-managed
DataIndependent data setsDifferent data, same operation

GPU Warp Execution Pattern:

  • Instruction: Same for all 32 lanes: add r1, r2
  • Lane 0: Operates on Data0 β†’ Result0
  • Lane 1: Operates on Data1 β†’ Result1
  • Lane 2: Operates on Data2 β†’ Result2
  • … (all lanes execute simultaneously)
  • Lane 31: Operates on Data31 β†’ Result31

Key insight: All lanes execute the same instruction at the same time on different data.

Why SIMT works for GPUs

GPUs are optimized for throughput, not latency. SIMT enables:

  • Hardware simplification: One instruction decoder serves 32 or 64 threads
  • Execution efficiency: No complex scheduling between warp threads
  • Memory bandwidth: Coalesced memory access patterns
  • Power efficiency: Shared control logic across lanes

Warp execution mechanics

Lane numbering and identity

Each thread within a warp has a lane ID from 0 to WARP_SIZE-1:

from gpu import lane_id
from gpu.warp import WARP_SIZE

# Within a kernel function:
my_lane = lane_id()  # Returns 0-31 (NVIDIA/RDNA) or 0-63 (CDNA)

Key insight: lane_id() is free - it’s just reading a hardware register, not computing a value.

Synchronization within warps

The most powerful aspect of SIMT: implicit synchronization.

# Traditional shared memory approach:
shared[local_i] = partial_result
barrier()  # Explicit synchronization required
var sum = shared[0] + shared[1] + ...  # Complex reduction

# Warp approach:
from gpu.warp import sum

var total = sum(partial_result)  # Implicit synchronization!

Why no barriers needed? All lanes execute each instruction at exactly the same time. When sum() starts, all lanes have already computed their partial_result.

Warp divergence and convergence

What happens with conditional code?

if lane_id() % 2 == 0:
    # Even lanes execute this path
    result = compute_even()
else:
    # Odd lanes execute this path
    result = compute_odd()
# All lanes converge here

Hardware behavior steps:

StepPhaseActive LanesWaiting LanesEfficiencyPerformance Cost
1Condition evaluationAll 32 lanesNone100%Normal speed
2Even lanes branchLanes 0,2,4…30 (16 lanes)Lanes 1,3,5…31 (16 lanes)50%2Γ— slower
3Odd lanes branchLanes 1,3,5…31 (16 lanes)Lanes 0,2,4…30 (16 lanes)50%2Γ— slower
4ConvergenceAll 32 lanesNone100%Normal speed resumed

Example breakdown:

  • Step 2: Only even lanes execute compute_even() while odd lanes wait
  • Step 3: Only odd lanes execute compute_odd() while even lanes wait
  • Total time: time(compute_even) + time(compute_odd) (sequential execution)
  • Without divergence: max(time(compute_even), time(compute_odd)) (parallel execution)

Performance impact:

  1. Divergence: Warp splits execution - some lanes active, others wait
  2. Serial execution: Different paths run sequentially, not in parallel
  3. Convergence: All lanes reunite and continue together
  4. Cost: Divergent warps take 2Γ— time (or more) vs unified execution

Best practices for warp efficiency

Warp efficiency patterns

βœ… EXCELLENT: Uniform execution (100% efficiency)

# All lanes do the same work - no divergence
var partial = a[global_i] * b[global_i]
var total = sum(partial)

Performance: All 32 lanes active simultaneously

⚠️ ACCEPTABLE: Predictable divergence (~95% efficiency)

# Divergence based on lane_id() - hardware optimized
if lane_id() == 0:
    output[block_idx] = sum(partial)

Performance: Brief single-lane operation, predictable pattern

πŸ”Ά CAUTION: Structured divergence (~50-75% efficiency)

# Regular patterns can be optimized by compiler
if (global_i / 4) % 2 == 0:
    result = method_a()
else:
    result = method_b()

Performance: Predictable groups, some optimization possible

❌ AVOID: Data-dependent divergence (~25-50% efficiency)

# Different lanes may take different paths based on data
if input[global_i] > threshold:  # Unpredictable branching
    result = expensive_computation()
else:
    result = simple_computation()

Performance: Random divergence kills warp efficiency

πŸ’€ TERRIBLE: Nested data-dependent divergence (~10-25% efficiency)

# Multiple levels of unpredictable branching
if input[global_i] > threshold1:
    if input[global_i] > threshold2:
        result = very_expensive()
    else:
        result = expensive()
else:
    result = simple()

Performance: Warp efficiency destroyed

Cross-architecture compatibility

NVIDIA vs AMD warp sizes

from gpu.warp import WARP_SIZE

# NVIDIA GPUs:     WARP_SIZE = 32
# AMD RDNA GPUs:   WARP_SIZE = 32 (wavefront32 mode)
# AMD CDNA GPUs:   WARP_SIZE = 64 (traditional wavefront64)

Why this matters:

  • Memory patterns: Coalesced access depends on warp size
  • Algorithm design: Reduction trees must account for warp size
  • Performance scaling: Twice as many lanes per warp on AMD

Writing portable warp code

Architecture Adaptation Strategies

βœ… PORTABLE: Always use WARP_SIZE

alias THREADS_PER_BLOCK = (WARP_SIZE, 1)  # Adapts automatically
alias ELEMENTS_PER_WARP = WARP_SIZE        # Scales with hardware

Result: Code works optimally on NVIDIA/AMD (32) and AMD (64)

❌ BROKEN: Never hardcode warp size

alias THREADS_PER_BLOCK = (32, 1)  # Breaks on AMD GPUs!
alias REDUCTION_SIZE = 32           # Wrong on AMD!

Result: Suboptimal on AMD, potential correctness issues

Real Hardware Impact

GPU ArchitectureWARP_SIZEMemory per WarpReduction StepsLane Pattern
NVIDIA/AMD RDNA32128 bytes (4Γ—32)5 steps: 32β†’16β†’8β†’4β†’2β†’1Lanes 0-31
AMD CDNA64256 bytes (4Γ—64)6 steps: 64β†’32β†’16β†’8β†’4β†’2β†’1Lanes 0-63

Performance implications of 64 vs 32:

  • CDNA advantage: 2Γ— memory bandwidth per warp
  • CDNA advantage: 2Γ— computation per warp
  • NVIDIA/RDNA advantage: More warps per block (better occupancy)
  • Code portability: Same source, optimal performance on both

Memory access patterns with warps

Coalesced Memory Access Patterns

βœ… PERFECT: Coalesced access (100% bandwidth utilization)

# Adjacent lanes β†’ adjacent memory addresses
var value = input[global_i]  # Lane 0β†’input[0], Lane 1β†’input[1], etc.

Memory access patterns:

Access PatternNVIDIA/RDNA (32 lanes)CDNA (64 lanes)Bandwidth UtilizationPerformance
βœ… CoalescedLane N β†’ Address 4Γ—NLane N β†’ Address 4Γ—N100%Optimal
1 transaction: 128 bytes1 transaction: 256 bytesFull bus widthFast
❌ ScatteredLane N β†’ Random addressLane N β†’ Random address~6%Terrible
32 separate transactions64 separate transactionsMostly idle bus32Γ— slower

Example addresses:

  • Coalesced: Lane 0β†’0, Lane 1β†’4, Lane 2β†’8, Lane 3β†’12, …
  • Scattered: Lane 0β†’1000, Lane 1β†’52, Lane 2β†’997, Lane 3β†’8, …

Shared memory bank conflicts

What is a bank conflict?

Assume that a GPU shared memory is divided into 32 independent banks that can be accessed simultaneously. A bank conflict occurs when multiple threads in a warp try to access different addresses within the same bank at the same time. When this happens, the hardware must serialize these accesses, turning what should be a single-cycle operation into multiple cycles.

Key concepts:

  • No conflict: Each thread accesses a different bank β†’ All accesses happen simultaneously (1 cycle)
  • Bank conflict: Multiple threads access the same bank β†’ Accesses happen sequentially (N cycles for N threads)
  • Broadcast: All threads access the same address β†’ Hardware optimizes this to 1 cycle

Shared memory bank organization:

BankAddresses (byte offsets)Example Data (float32)
Bank 00, 128, 256, 384, …shared[0], shared[32], shared[64], …
Bank 14, 132, 260, 388, …shared[1], shared[33], shared[65], …
Bank 28, 136, 264, 392, …shared[2], shared[34], shared[66], …
………
Bank 31124, 252, 380, 508, …shared[31], shared[63], shared[95], …

Bank conflict examples:

Access PatternBank UsageCyclesPerformanceExplanation
βœ… Sequentialshared[thread_idx.x]1 cycle100%Each lane hits different bank
Lane 0β†’Bank 0, Lane 1β†’Bank 1, …OptimalNo conflicts
❌ Stride 2shared[thread_idx.x * 2]2 cycles50%2 lanes per bank
Lane 0,16β†’Bank 0; Lane 1,17β†’Bank 12Γ— slowerSerialized access
πŸ’€ Same indexshared[0] (all lanes)32 cycles3%All lanes hit Bank 0
All 32 lanes→Bank 032× slowerCompletely serialized

Practical implications for warp programming

When warp operations are most effective

  1. Reduction operations: sum(), max(), etc.
  2. Broadcast operations: shuffle_idx() to share values
  3. Neighbor communication: shuffle_down() for sliding windows
  4. Prefix computations: prefix_sum() for scan algorithms

Performance characteristics

Operation TypeTraditionalWarp Operations
Reduction (32 elements)~10 instructions1 instruction
Memory trafficHighMinimal
Synchronization costExpensiveFree
Code complexityHighLow

Next steps

Now that you understand the SIMT foundation, you’re ready to see how these concepts enable powerful warp operations. The next section will show you how sum() transforms complex reduction patterns into simple, efficient function calls.

β†’ Continue to warp.sum() Essentials