Performance: Coalesced vs non-coalesced memory access
Understanding memory access patterns is crucial for GPU performance optimization. This section explains why coalesced memory access patterns typically outperform non-coalesced patterns, particularly for memory-bound operations like embedding lookups.
Memory coalescing basics
Memory coalescing occurs when consecutive threads in a warp access consecutive memory addresses. GPUs can combine these individual memory requests into fewer, larger memory transactions, dramatically improving bandwidth utilization.
Coalesced vs non-coalesced access
Coalesced (efficient):
- Thread 0 β Address 0x1000
- Thread 1 β Address 0x1004
- Thread 2 β Address 0x1008
- Thread 3 β Address 0x100C
- ...
Result: 1 memory transaction for entire warp (32 threads)
Non-coalesced (inefficient):
- Thread 0 β Address 0x1000
- Thread 1 β Address 0x2000
- Thread 2 β Address 0x3000
- Thread 3 β Address 0x4000
- ...
Result: Up to 32 separate memory transactions
Why embedding operations are memory-bound
Embedding lookups are memory-bound because they involve:
- Minimal computation: Just copying data from input to output
- Large memory footprint: Embedding tables can be gigabytes in size
- High memory bandwidth requirements: Need to transfer large amounts of data
For such operations, memory access efficiency determines performance more than computational complexity.
Kernel comparison
1D coalesced kernel
- Thread organization:
[total_elements // 256]
blocks, one thread per output element - Memory pattern: Consecutive threads access consecutive embedding dimensions
- Why itβs coalesced:
Thread 0: output[0,0,0]
,Thread 1: output[0,0,1]
β consecutive addresses
2D non-coalesced kernel
- Thread organization:
[batch*seq // 16, embed_dim // 16]
blocks with 16Γ16 threads - Memory pattern: Threads may access different embedding vectors
- Why itβs non-coalesced: Thread access pattern can be scattered across memory
Performance results
Typical benchmark results:
Performance Results:
1D Coalesced: 2.145 ms
2D Non-coalesced: 3.867 ms
1D is 1.80x faster than 2D
Memory access visualization
Coalesced pattern (1D kernel)
Warp execution for output[0,0,0:32]:
Element | Thread ID | Memory Access | Address Pattern |
---|---|---|---|
output[0,0,0] | 0 | [0,0] | Base + 0 |
output[0,0,1] | 1 | [0,1] | Base + 4 |
output[0,0,2] | 2 | [0,2] | Base + 8 |
output[0,0,3] | 3 | [0,3] | Base + 12 |
β¦ | β¦ | β¦ | β¦ |
output[0,0,31] | 31 | [0,31] | Base + 124 |
Result: Consecutive addresses β 1 memory transaction for entire warp
Non-coalesced pattern (2D kernel)
Warp execution with 16Γ16 blocks:
Block organization (16Γ16):
X-dim: batch*seq positions (0-15)
Y-dim: embed dimensions (0-15)
Warp threads might access:
Thread 0: batch=0, seq=0, embed=0 β Address A
Thread 1: batch=0, seq=1, embed=0 β Address B (different row)
Thread 2: batch=0, seq=2, embed=0 β Address C (different row)
...
Thread 31: batch=1, seq=15, embed=0 β Address Z (scattered)
Result: Potentially scattered addresses β Multiple memory transactions
Key optimization strategies
- Prefer 1D indexing for memory-bound operations when possible
- Align data structures to coalescing-friendly layouts
- Consider memory access patterns during kernel design
- Profile memory bandwidth to identify bottlenecks
- Use memory-bound benchmarks to validate optimizations
The core insight: memory access patterns often determine GPU performance more than computational complexity, especially for memory-bound operations like embeddings.