π§ Warp Lanes & SIMT Execution
Mental model for warp programming vs SIMD
What is a warp?
A warp is a group of 32 (or 64) GPU threads that execute the same instruction at the same time on different data. Think of it as a synchronized vector unit where each thread acts like a βlaneβ in a vector processor.
Simple example:
from gpu.warp import sum
# All 32 threads in the warp execute this simultaneously:
var my_value = input[my_thread_id] # Each gets different data
var warp_total = sum(my_value) # All contribute to one sum
What just happened? Instead of 32 separate threads doing complex coordination, the warp automatically synchronized them to produce a single result. This is SIMT (Single Instruction, Multiple Thread) execution.
SIMT vs SIMD comparison
If youβre familiar with CPU vector programming (SIMD), GPU warps are similar but with key differences:
Aspect | CPU SIMD (e.g., AVX) | GPU Warp (SIMT) |
---|---|---|
Programming model | Explicit vector operations | Thread-based programming |
Data width | Fixed (256/512 bits) | Flexible (32/64 threads) |
Synchronization | Implicit within instruction | Implicit within warp |
Communication | Via memory/registers | Via shuffle operations |
Divergence handling | Not applicable | Hardware masking |
Example | a + b | sum(thread_value) |
CPU SIMD approach (C++ intrinsics):
// Explicit vector operations - say 8 floats in parallel
__m256 result = _mm256_add_ps(a, b); // Add 8 pairs simultaneously
CPU SIMD approach (Mojo):
# SIMD in Mojo is first class citizen type so if a, b are of type SIMD then
# addition 8 floats in parallel
var result = a + b # Add 8 pairs simultaneously
GPU SIMT approach (Mojo):
# Thread-based code that becomes vector operations
from gpu.warp import sum
var my_data = input[thread_id] # Each thread gets its element
var partial = my_data * coefficient # All threads compute simultaneously
var total = sum(partial) # Hardware coordinates the sum
Core concepts that make warps powerful
1. Lane identity: Each thread has a βlane IDβ (0 to 31) thatβs essentially free to access
var my_lane = lane_id() # Just reading a hardware register
2. Implicit synchronization: No barriers needed within a warp
# This just works - all threads automatically synchronized
var sum = sum(my_contribution)
3. Efficient communication: Threads can share data without memory
# Get value from lane 0 to all other lanes
var broadcasted = shuffle_idx(my_value, 0)
Key insight: SIMT lets you write natural thread code that executes as efficient vector operations, combining the ease of thread programming with the performance of vector processing.
Where warps fit in GPU execution hierarchy
For complete context on how warps relate to the overall GPU execution model, see GPU Threading vs SIMD. Hereβs where warps fit:
GPU Device
βββ Grid (your entire problem)
β βββ Block 1 (group of threads, shared memory)
β β βββ Warp 1 (32 threads, lockstep execution) β This level
β β β βββ Thread 1 β SIMD operations
β β β βββ Thread 2 β SIMD operations
β β β βββ ... (32 threads total)
β β βββ Warp 2 (32 threads)
β βββ Block 2 (independent group)
Warp programming operates at the βWarp levelβ - you work with operations that coordinate all 32 threads within a single warp, enabling powerful primitives like sum()
that would otherwise require complex shared memory coordination.
This mental model helps you recognize when problems map naturally to warp operations versus requiring traditional shared memory approaches.
The hardware foundation of warp programming
Understanding Single Instruction, Multiple Thread (SIMT) execution is crucial for effective warp programming. This isnβt just a software abstraction - itβs how GPU hardware actually works at the silicon level.
What is SIMT execution?
SIMT means that within a warp, all threads execute the same instruction at the same time on different data. This is fundamentally different from CPU threads, which can execute completely different instructions independently.
CPU vs GPU Execution Models
Aspect | CPU (MIMD) | GPU Warp (SIMT) |
---|---|---|
Instruction Model | Multiple Instructions, Multiple Data | Single Instruction, Multiple Thread |
Core 1 | add r1, r2 | add r1, r2 |
Core 2 | load r3, [mem] | add r1, r2 (same instruction) |
Core 3 | branch loop | add r1, r2 (same instruction) |
β¦ Core 32 | different instruction | add r1, r2 (same instruction) |
Execution | Independent, asynchronous | Synchronized, lockstep |
Scheduling | Complex, OS-managed | Simple, hardware-managed |
Data | Independent data sets | Different data, same operation |
GPU Warp Execution Pattern:
- Instruction: Same for all 32 lanes:
add r1, r2
- Lane 0: Operates on
Data0
βResult0
- Lane 1: Operates on
Data1
βResult1
- Lane 2: Operates on
Data2
βResult2
- β¦ (all lanes execute simultaneously)
- Lane 31: Operates on
Data31
βResult31
Key insight: All lanes execute the same instruction at the same time on different data.
Why SIMT works for GPUs
GPUs are optimized for throughput, not latency. SIMT enables:
- Hardware simplification: One instruction decoder serves 32 or 64 threads
- Execution efficiency: No complex scheduling between warp threads
- Memory bandwidth: Coalesced memory access patterns
- Power efficiency: Shared control logic across lanes
Warp execution mechanics
Lane numbering and identity
Each thread within a warp has a lane ID from 0 to WARP_SIZE-1
:
from gpu import lane_id
from gpu.warp import WARP_SIZE
# Within a kernel function:
my_lane = lane_id() # Returns 0-31 (NVIDIA/RDNA) or 0-63 (CDNA)
Key insight: lane_id()
is free - itβs just reading a hardware register, not computing a value.
Synchronization within warps
The most powerful aspect of SIMT: implicit synchronization.
# Traditional shared memory approach:
shared[local_i] = partial_result
barrier() # Explicit synchronization required
var sum = shared[0] + shared[1] + ... # Complex reduction
# Warp approach:
from gpu.warp import sum
var total = sum(partial_result) # Implicit synchronization!
Why no barriers needed? All lanes execute each instruction at exactly the same time. When sum()
starts, all lanes have already computed their partial_result
.
Warp divergence and convergence
What happens with conditional code?
if lane_id() % 2 == 0:
# Even lanes execute this path
result = compute_even()
else:
# Odd lanes execute this path
result = compute_odd()
# All lanes converge here
Hardware behavior steps:
Step | Phase | Active Lanes | Waiting Lanes | Efficiency | Performance Cost |
---|---|---|---|---|---|
1 | Condition evaluation | All 32 lanes | None | 100% | Normal speed |
2 | Even lanes branch | Lanes 0,2,4β¦30 (16 lanes) | Lanes 1,3,5β¦31 (16 lanes) | 50% | 2Γ slower |
3 | Odd lanes branch | Lanes 1,3,5β¦31 (16 lanes) | Lanes 0,2,4β¦30 (16 lanes) | 50% | 2Γ slower |
4 | Convergence | All 32 lanes | None | 100% | Normal speed resumed |
Example breakdown:
- Step 2: Only even lanes execute
compute_even()
while odd lanes wait - Step 3: Only odd lanes execute
compute_odd()
while even lanes wait - Total time:
time(compute_even) + time(compute_odd)
(sequential execution) - Without divergence:
max(time(compute_even), time(compute_odd))
(parallel execution)
Performance impact:
- Divergence: Warp splits execution - some lanes active, others wait
- Serial execution: Different paths run sequentially, not in parallel
- Convergence: All lanes reunite and continue together
- Cost: Divergent warps take 2Γ time (or more) vs unified execution
Best practices for warp efficiency
Warp efficiency patterns
β EXCELLENT: Uniform execution (100% efficiency)
# All lanes do the same work - no divergence
var partial = a[global_i] * b[global_i]
var total = sum(partial)
Performance: All 32 lanes active simultaneously
β οΈ ACCEPTABLE: Predictable divergence (~95% efficiency)
# Divergence based on lane_id() - hardware optimized
if lane_id() == 0:
output[block_idx] = sum(partial)
Performance: Brief single-lane operation, predictable pattern
πΆ CAUTION: Structured divergence (~50-75% efficiency)
# Regular patterns can be optimized by compiler
if (global_i / 4) % 2 == 0:
result = method_a()
else:
result = method_b()
Performance: Predictable groups, some optimization possible
β AVOID: Data-dependent divergence (~25-50% efficiency)
# Different lanes may take different paths based on data
if input[global_i] > threshold: # Unpredictable branching
result = expensive_computation()
else:
result = simple_computation()
Performance: Random divergence kills warp efficiency
π TERRIBLE: Nested data-dependent divergence (~10-25% efficiency)
# Multiple levels of unpredictable branching
if input[global_i] > threshold1:
if input[global_i] > threshold2:
result = very_expensive()
else:
result = expensive()
else:
result = simple()
Performance: Warp efficiency destroyed
Cross-architecture compatibility
NVIDIA vs AMD warp sizes
from gpu.warp import WARP_SIZE
# NVIDIA GPUs: WARP_SIZE = 32
# AMD RDNA GPUs: WARP_SIZE = 32 (wavefront32 mode)
# AMD CDNA GPUs: WARP_SIZE = 64 (traditional wavefront64)
Why this matters:
- Memory patterns: Coalesced access depends on warp size
- Algorithm design: Reduction trees must account for warp size
- Performance scaling: Twice as many lanes per warp on AMD
Writing portable warp code
Architecture Adaptation Strategies
β
PORTABLE: Always use WARP_SIZE
alias THREADS_PER_BLOCK = (WARP_SIZE, 1) # Adapts automatically
alias ELEMENTS_PER_WARP = WARP_SIZE # Scales with hardware
Result: Code works optimally on NVIDIA/AMD (32) and AMD (64)
β BROKEN: Never hardcode warp size
alias THREADS_PER_BLOCK = (32, 1) # Breaks on AMD GPUs!
alias REDUCTION_SIZE = 32 # Wrong on AMD!
Result: Suboptimal on AMD, potential correctness issues
Real Hardware Impact
GPU Architecture | WARP_SIZE | Memory per Warp | Reduction Steps | Lane Pattern |
---|---|---|---|---|
NVIDIA/AMD RDNA | 32 | 128 bytes (4Γ32) | 5 steps: 32β16β8β4β2β1 | Lanes 0-31 |
AMD CDNA | 64 | 256 bytes (4Γ64) | 6 steps: 64β32β16β8β4β2β1 | Lanes 0-63 |
Performance implications of 64 vs 32:
- CDNA advantage: 2Γ memory bandwidth per warp
- CDNA advantage: 2Γ computation per warp
- NVIDIA/RDNA advantage: More warps per block (better occupancy)
- Code portability: Same source, optimal performance on both
Memory access patterns with warps
Coalesced Memory Access Patterns
β PERFECT: Coalesced access (100% bandwidth utilization)
# Adjacent lanes β adjacent memory addresses
var value = input[global_i] # Lane 0βinput[0], Lane 1βinput[1], etc.
Memory access patterns:
Access Pattern | NVIDIA/RDNA (32 lanes) | CDNA (64 lanes) | Bandwidth Utilization | Performance |
---|---|---|---|---|
β Coalesced | Lane N β Address 4ΓN | Lane N β Address 4ΓN | 100% | Optimal |
1 transaction: 128 bytes | 1 transaction: 256 bytes | Full bus width | Fast | |
β Scattered | Lane N β Random address | Lane N β Random address | ~6% | Terrible |
32 separate transactions | 64 separate transactions | Mostly idle bus | 32Γ slower |
Example addresses:
- Coalesced: Lane 0β0, Lane 1β4, Lane 2β8, Lane 3β12, β¦
- Scattered: Lane 0β1000, Lane 1β52, Lane 2β997, Lane 3β8, β¦
Shared memory bank conflicts
What is a bank conflict?
Assume that a GPU shared memory is divided into 32 independent banks that can be accessed simultaneously. A bank conflict occurs when multiple threads in a warp try to access different addresses within the same bank at the same time. When this happens, the hardware must serialize these accesses, turning what should be a single-cycle operation into multiple cycles.
Key concepts:
- No conflict: Each thread accesses a different bank β All accesses happen simultaneously (1 cycle)
- Bank conflict: Multiple threads access the same bank β Accesses happen sequentially (N cycles for N threads)
- Broadcast: All threads access the same address β Hardware optimizes this to 1 cycle
Shared memory bank organization:
Bank | Addresses (byte offsets) | Example Data (float32) |
---|---|---|
Bank 0 | 0, 128, 256, 384, β¦ | shared[0] , shared[32] , shared[64] , β¦ |
Bank 1 | 4, 132, 260, 388, β¦ | shared[1] , shared[33] , shared[65] , β¦ |
Bank 2 | 8, 136, 264, 392, β¦ | shared[2] , shared[34] , shared[66] , β¦ |
β¦ | β¦ | β¦ |
Bank 31 | 124, 252, 380, 508, β¦ | shared[31] , shared[63] , shared[95] , β¦ |
Bank conflict examples:
Access Pattern | Bank Usage | Cycles | Performance | Explanation |
---|---|---|---|---|
β Sequential | shared[thread_idx.x] | 1 cycle | 100% | Each lane hits different bank |
Lane 0βBank 0, Lane 1βBank 1, β¦ | Optimal | No conflicts | ||
β Stride 2 | shared[thread_idx.x * 2] | 2 cycles | 50% | 2 lanes per bank |
Lane 0,16βBank 0; Lane 1,17βBank 1 | 2Γ slower | Serialized access | ||
π Same index | shared[0] (all lanes) | 32 cycles | 3% | All lanes hit Bank 0 |
All 32 lanesβBank 0 | 32Γ slower | Completely serialized |
Practical implications for warp programming
When warp operations are most effective
- Reduction operations:
sum()
,max()
, etc. - Broadcast operations:
shuffle_idx()
to share values - Neighbor communication:
shuffle_down()
for sliding windows - Prefix computations:
prefix_sum()
for scan algorithms
Performance characteristics
Operation Type | Traditional | Warp Operations |
---|---|---|
Reduction (32 elements) | ~10 instructions | 1 instruction |
Memory traffic | High | Minimal |
Synchronization cost | Expensive | Free |
Code complexity | High | Low |
Next steps
Now that you understand the SIMT foundation, youβre ready to see how these concepts enable powerful warp operations. The next section will show you how sum()
transforms complex reduction patterns into simple, efficient function calls.
β Continue to warp.sum() Essentials