🧠 GPU Threading vs SIMD - Understanding the Execution Hierarchy

Overview

After exploring elementwise, tiled, and vectorization patterns, you’ve seen different ways to organize GPU computation. This section clarifies the fundamental relationship between GPU threads and SIMD operations - two distinct but complementary levels of parallelism that work together for optimal performance.

Key insight: GPU threads provide the parallelism structure, while SIMD operations provide the vectorization within each thread.

Core concepts

GPU threading hierarchy

GPU execution follows a well-defined hierarchy that abstracts hardware complexity:

GPU Device
├── Grid (your entire problem)
│   ├── Block 1 (group of threads, shared memory)
│   │   ├── Warp 1 (32 threads, lockstep execution)
│   │   │   ├── Thread 1 → SIMD operations
│   │   │   ├── Thread 2 → SIMD operations
│   │   │   └── ... (32 threads total)
│   │   └── Warp 2 (32 threads)
│   └── Block 2 (independent group)

💡 Note: While this Part focuses on functional patterns, warp-level programming and advanced GPU memory management will be covered in detail in Part VI.

What Mojo abstracts for you:

Grid/Block configuration: Automatically calculated based on problem size
Warp management: Hardware handles 32-thread groups transparently
Thread scheduling: GPU scheduler manages execution automatically
Memory hierarchy: Optimal access patterns built into functional operations

SIMD within GPU threads

Each GPU thread can process multiple data elements simultaneously using SIMD (Single Instruction, Multiple Data) operations:

// Within one GPU thread:
a_simd = a.load[simd_width](idx, 0)    // Load 4 floats simultaneously
b_simd = b.load[simd_width](idx, 0)    // Load 4 floats simultaneously
result = a_simd + b_simd               // Add 4 pairs simultaneously
out.store[simd_width](idx, 0, result)  // Store 4 results simultaneously

Pattern comparison and thread-to-work mapping

Critical insight: All patterns perform the same total work - 256 SIMD operations for 1024 elements with SIMD_WIDTH=4. The difference is in how this work is distributed across GPU threads.

Thread organization comparison (`SIZE=1024`, `SIMD_WIDTH=4`)

Pattern	Threads	SIMD ops/thread	Memory pattern	Trade-off
Elementwise	256	1	Distributed access	Max parallelism, poor locality
Tiled	32	8	Small blocks	Balanced parallelism + locality
Manual vectorized	8	32	Large chunks	High bandwidth, fewer threads
Mojo vectorize	32	8	Smart blocks	Automatic optimization

Detailed execution patterns

Elementwise pattern:

Thread 0: [0,1,2,3] → Thread 1: [4,5,6,7] → ... → Thread 255: [1020,1021,1022,1023]
256 threads × 1 SIMD op = 256 total SIMD operations

Tiled pattern:

Thread 0: [0:32] (8 SIMD) → Thread 1: [32:64] (8 SIMD) → ... → Thread 31: [992:1024] (8 SIMD)
32 threads × 8 SIMD ops = 256 total SIMD operations

Manual vectorized pattern:

Thread 0: [0:128] (32 SIMD) → Thread 1: [128:256] (32 SIMD) → ... → Thread 7: [896:1024] (32 SIMD)
8 threads × 32 SIMD ops = 256 total SIMD operations

Mojo vectorize pattern:

Thread 0: [0:32] auto-vectorized → Thread 1: [32:64] auto-vectorized → ... → Thread 31: [992:1024] auto-vectorized
32 threads × 8 SIMD ops = 256 total SIMD operations

Performance characteristics and trade-offs

Core trade-offs summary

Aspect	High thread count (Elementwise)	Moderate threads (Tiled/Vectorize)	Low threads (Manual)
Parallelism	Maximum latency hiding	Balanced approach	Minimal parallelism
Cache locality	Poor between threads	Good within tiles	Excellent sequential
Memory bandwidth	Good coalescing	Good + cache reuse	Maximum theoretical
Complexity	Simplest	Moderate	Most complex

When to choose each pattern

Use elementwise when:

Simple operations with minimal arithmetic per element
Maximum parallelism needed for latency hiding
Scalability across different problem sizes is important

Use tiled/vectorize when:

Cache-sensitive operations that benefit from data reuse
Balanced performance and maintainability desired
Automatic optimization (vectorize) is preferred

Use manual vectorization when:

Expert-level control over memory patterns is needed
Maximum memory bandwidth utilization is critical
Development complexity is acceptable

Hardware considerations

Modern GPU architectures include several levels that Mojo abstracts:

Hardware reality:

Warps: 32 threads execute in lockstep
Streaming Multiprocessors (SMs): Multiple warps execute concurrently
SIMD units: Vector processing units within each SM
Memory hierarchy: L1/L2 caches, shared memory, global memory

Mojo’s abstraction benefits:

Automatically handles warp alignment and scheduling
Optimizes memory access patterns transparently
Manages resource allocation across SMs
Provides portable performance across GPU vendors

Performance mental model

Think of GPU programming as managing two complementary types of parallelism:

Thread-level parallelism:

Provides the parallel structure (how many execution units)
Enables latency hiding through concurrent execution
Managed by GPU scheduler automatically

SIMD-level parallelism:

Provides vectorization within each thread
Maximizes arithmetic throughput per thread
Utilizes vector processing units efficiently

Optimal performance formula:

Performance = (Sufficient threads for latency hiding) ×
              (Efficient SIMD utilization) ×
              (Optimal memory access patterns)

Scaling considerations

Problem size	Optimal pattern	Reasoning
Small (< 1K)	Tiled/Vectorize	Lower launch overhead
Medium (1K-1M)	Any pattern	Similar performance
Large (> 1M)	Usually Elementwise	Parallelism dominates

The optimal choice depends on your specific hardware, workload complexity, and development constraints.

Next steps

With a solid understanding of GPU threading vs SIMD concepts:

📊 Benchmarking: Measure and compare actual performance

💡 Key takeaway: GPU threads and SIMD operations work together as complementary levels of parallelism. Understanding their relationship allows you to choose the right pattern for your specific performance requirements and constraints.