⚡ Tile - Memory-Efficient Tiled Processing
Overview
Building on the elementwise pattern, this puzzle introduces tiled processing - a fundamental technique for optimizing memory access patterns and cache utilization on GPUs. Instead of each thread processing individual SIMD vectors across the entire array, tiling organizes data into smaller, manageable chunks that fit better in cache memory.
You’ve already seen tiling in action with Puzzle 14’s tiled matrix multiplication, where we used tiles to process large matrices efficiently. Here, we apply the same tiling principles to vector operations, demonstrating how this technique scales from 2D matrices to 1D arrays.
Implement the same vector addition operation using Mojo’s tiled approach. Each GPU thread will process an entire tile of data sequentially, demonstrating how memory locality can improve performance for certain workloads.
Key insight: Tiling trades parallel breadth for memory locality - fewer threads each doing more work with better cache utilization.
Key concepts
In this puzzle, you’ll master:
- Tile-based memory organization for cache optimization
- Sequential SIMD processing within tiles
- Memory locality principles and cache-friendly access patterns
- Thread-to-tile mapping vs thread-to-element mapping
- Performance trade-offs between parallelism and memory efficiency
The same mathematical operation as elementwise: \[\Large \text{out}[i] = a[i] + b[i]\]
But with a completely different execution strategy optimized for memory hierarchy.
Configuration
- Vector size:
SIZE = 1024
- Tile size:
TILE_SIZE = 32
- Data type:
DType.float32
- SIMD width: GPU-dependent (for operations within tiles)
- Layout:
Layout.row_major(SIZE)
(1D row-major)
Code to complete
alias TILE_SIZE = 32
fn tiled_elementwise_add[
layout: Layout,
dtype: DType,
simd_width: Int,
rank: Int,
size: Int,
tile_size: Int,
](
out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
@always_inline
fn process_tiles[
simd_width: Int, rank: Int
](indices: IndexList[rank]) capturing -> None:
tile_id = indices[0]
print("tile_id:", tile_id)
out_tile = out.tile[tile_size](tile_id)
a_tile = a.tile[tile_size](tile_id)
b_tile = b.tile[tile_size](tile_id)
# FILL IN (6 lines at most)
num_tiles = (size + tile_size - 1) // tile_size
elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)
View full file: problems/p20/p20.mojo
Tips
1. Understanding tile organization
The tiled approach divides your data into fixed-size chunks:
num_tiles = (size + tile_size - 1) // tile_size # Ceiling division
For a 1024-element vector with TILE_SIZE=32
: 1024 ÷ 32 = 32
tiles exactly.
2. Tile extraction pattern
Check out the LayoutTensor .tile
documentation.
tile_id = indices[0] # Each thread gets one tile to process
out_tile = out.tile[tile_size](tile_id)
a_tile = a.tile[tile_size](tile_id)
b_tile = b.tile[tile_size](tile_id)
The tile[size](id)
method creates a view of size
consecutive elements starting at id × size
.
3. Sequential processing within tiles
Unlike elementwise, you process the tile sequentially:
@parameter
for i in range(tile_size):
# Process element i within the current tile
This @parameter
loop unrolls at compile-time for optimal performance.
4. SIMD operations within tile elements
a_vec = a_tile.load[simd_width](i, 0) # Load from position i in tile
b_vec = b_tile.load[simd_width](i, 0) # Load from position i in tile
result = a_vec + b_vec # SIMD addition (GPU-dependent width)
out_tile.store[simd_width](i, 0, result) # Store to position i in tile
5. Thread configuration difference
elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)
Note the 1
instead of SIMD_WIDTH
- each thread processes one entire tile sequentially.
6. Memory access pattern insight
Each thread accesses a contiguous block of memory (the tile), then moves to the next tile. This creates excellent spatial locality within each thread’s execution.
7. Key debugging insight
With tiling, you’ll see fewer thread launches but each does more work:
- Elementwise: ~256 threads (for SIMD_WIDTH=4), each processing 4 elements
- Tiled: ~32 threads, each processing 32 elements sequentially
Running the code
To test your solution, run the following command in your terminal:
uv run poe p20 --tiled
pixi run p20 --tiled
Your output will look like this when not yet solved:
SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0
tile_id: 1
tile_id: 2
tile_id: 3
...
tile_id: 29
tile_id: 30
tile_id: 31
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])
Solution
alias TILE_SIZE = 32
fn tiled_elementwise_add[
layout: Layout,
dtype: DType,
simd_width: Int,
rank: Int,
size: Int,
tile_size: Int,
](
output: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
@always_inline
fn process_tiles[
simd_width: Int, rank: Int
](indices: IndexList[rank]) capturing -> None:
tile_id = indices[0]
output_tile = output.tile[tile_size](tile_id)
a_tile = a.tile[tile_size](tile_id)
b_tile = b.tile[tile_size](tile_id)
@parameter
for i in range(tile_size):
a_vec = a_tile.load[simd_width](i, 0)
b_vec = b_tile.load[simd_width](i, 0)
ret = a_vec + b_vec
output_tile.store[simd_width](i, 0, ret)
num_tiles = (size + tile_size - 1) // tile_size
elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)
The tiled processing pattern demonstrates advanced memory optimization techniques for GPU programming:
1. Tiling philosophy and memory hierarchy
Tiling represents a fundamental shift in how we think about parallel processing:
Elementwise approach:
- Wide parallelism: Many threads, each doing minimal work
- Global memory pressure: Threads scattered across entire array
- Cache misses: Poor spatial locality across thread boundaries
Tiled approach:
- Deep parallelism: Fewer threads, each doing substantial work
- Localized memory access: Each thread works on contiguous data
- Cache optimization: Excellent spatial and temporal locality
2. Tile organization and indexing
tile_id = indices[0]
out_tile = out.tile[tile_size](tile_id)
a_tile = a.tile[tile_size](tile_id)
b_tile = b.tile[tile_size](tile_id)
Tile mapping visualization (TILE_SIZE=32):
Original array: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ..., 1023]
Tile 0 (thread 0): [0, 1, 2, ..., 31] ← Elements 0-31
Tile 1 (thread 1): [32, 33, 34, ..., 63] ← Elements 32-63
Tile 2 (thread 2): [64, 65, 66, ..., 95] ← Elements 64-95
...
Tile 31 (thread 31): [992, 993, ..., 1023] ← Elements 992-1023
Key insights:
- Each
tile[size](id)
creates a view into the original tensor - Views are zero-copy - no data movement, just pointer arithmetic
- Tile boundaries are always aligned to
tile_size
boundaries
3. Sequential processing deep dive
@parameter
for i in range(tile_size):
a_vec = a_tile.load[simd_width](i, 0)
b_vec = b_tile.load[simd_width](i, 0)
ret = a_vec + b_vec
out_tile.store[simd_width](i, 0, ret)
Why sequential processing?
- Cache optimization: Consecutive memory accesses maximize cache hit rates
- Compiler optimization:
@parameter
loops unroll completely at compile-time - Memory bandwidth: Sequential access aligns with memory controller design
- Reduced coordination: No need to synchronize between SIMD groups
Execution pattern within one tile (TILE_SIZE=32, SIMD_WIDTH=4):
Thread processes tile sequentially:
Step 0: Process elements [0:4] with SIMD
Step 1: Process elements [4:8] with SIMD
Step 2: Process elements [8:12] with SIMD
...
Step 7: Process elements [28:32] with SIMD
Total: 8 SIMD operations per thread (32 ÷ 4 = 8)
4. Memory access pattern analysis
Cache behavior comparison:
Elementwise pattern:
Thread 0: accesses global positions [0, 4, 8, 12, ...] ← Stride = SIMD_WIDTH
Thread 1: accesses global positions [4, 8, 12, 16, ...] ← Stride = SIMD_WIDTH
...
Result: Memory accesses spread across entire array
Tiled pattern:
Thread 0: accesses positions [0:32] sequentially ← Contiguous 32-element block
Thread 1: accesses positions [32:64] sequentially ← Next contiguous 32-element block
...
Result: Perfect spatial locality within each thread
Cache efficiency implications:
- L1 cache: Small tiles often fit better in L1 cache, reducing cache misses
- Memory bandwidth: Sequential access maximizes effective bandwidth
- TLB efficiency: Fewer translation lookbook buffer misses
- Prefetching: Hardware prefetchers work optimally with sequential patterns
5. Thread configuration strategy
elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)
Why 1
instead of SIMD_WIDTH
?
- Thread count: Launch exactly
num_tiles
threads, notnum_tiles × SIMD_WIDTH
- Work distribution: Each thread handles one complete tile
- Load balancing: More work per thread, fewer threads total
- Memory locality: Each thread’s work is spatially localized
Performance trade-offs:
- Fewer logical threads: May not fully utilize all GPU cores at low occupancy
- More work per thread: Better cache utilization and reduced coordination overhead
- Sequential access: Optimal memory bandwidth utilization within each thread
- Reduced overhead: Less thread launch and coordination overhead
Important note: “Fewer threads” refers to the logical programming model. The GPU scheduler can still achieve high hardware utilization by running multiple warps and efficiently switching between them during memory stalls.
6. Performance characteristics
When tiling helps:
- Memory-bound operations: When memory bandwidth is the bottleneck
- Cache-sensitive workloads: Operations that benefit from data reuse
- Complex operations: When compute per element is higher
- Limited parallelism: When you have fewer threads than GPU cores
When tiling hurts:
- Highly parallel workloads: When you need maximum thread utilization
- Simple operations: When memory access dominates over computation
- Irregular access patterns: When tiling doesn’t improve locality
For our simple addition example (TILE_SIZE=32):
- Thread count: 32 threads instead of 256 (8× fewer)
- Work per thread: 32 elements instead of 4 (8× more)
- Memory pattern: Sequential vs strided access
- Cache utilization: Much better spatial locality
7. Advanced tiling considerations
Tile size selection:
- Too small: Poor cache utilization, more overhead
- Too large: May not fit in cache, reduced parallelism
- Sweet spot: Usually 16-64 elements for L1 cache optimization
- Our choice: 32 elements balances cache usage with parallelism
Hardware considerations:
- Cache size: Tiles should fit in L1 cache when possible
- Memory bandwidth: Consider memory controller width
- Core count: Ensure enough tiles to utilize all cores
- SIMD width: Tile size should be multiple of SIMD width
Comparison summary:
Elementwise: High parallelism, scattered memory access
Tiled: Moderate parallelism, localized memory access
The choice between elementwise and tiled patterns depends on your specific workload characteristics, data access patterns, and target hardware capabilities.
Next steps
Now that you understand both elementwise and tiled patterns:
- 🔧 Vectorization: Fine-grained control over SIMD operations
- 🧠 GPU Threading vs SIMD: Understanding the execution hierarchy
- 📊 Benchmarking: Performance analysis and optimization
💡 Key takeaway: Tiling demonstrates how memory access patterns often matter more than raw computational throughput. The best GPU code balances parallelism with memory hierarchy optimization.