Puzzle 19: Embedding Op
Memory access patterns and performance
We’re continuing Part IV with a focus on memory-bound operations and GPU memory access optimization.
Building on Puzzle 18, you’ll now explore how different kernel implementations of the same operation can have dramatically different performance characteristics. You’ll learn:
- How GPU memory coalescing affects performance
- Why grid configuration matters for memory-bound operations
- How to design kernels with optimal memory access patterns
- The performance implications of different threading strategies
This puzzle demonstrates that how you access memory can be more important than what computation you perform.
Overview
In this puzzle, you’ll implement two different GPU kernels for embedding operations - a fundamental component in neural networks. While both kernels produce identical results, they use different memory access patterns that lead to significant performance differences.
You’ll compare:
- 1D coalesced kernel: Optimized for memory bandwidth with consecutive memory accesses
- 2D non-coalesced kernel: Suboptimal memory access pattern for comparison
This comparison teaches the critical importance of memory coalescing in GPU kernel performance.
Background: Embedding operations
An embedding operation converts discrete token indices into dense vector representations:
# Input: token indices
indices = [[1, 5, 2], [7, 1, 9]] # Shape: [batch_size, seq_len]
# Embedding table (learned parameters)
embedding_table = [ # Shape: [vocab_size, embed_dim]
[0.1, 0.2, 0.3, 0.4], # Token 0
[0.5, 0.6, 0.7, 0.8], # Token 1
[0.9, 1.0, 1.1, 1.2], # Token 2
# ... more tokens
]
# Output: embedded vectors
output[0,0] = embedding_table[1] # [0.5, 0.6, 0.7, 0.8]
output[0,1] = embedding_table[5] # lookup token 5's embedding
output[0,2] = embedding_table[2] # [0.9, 1.0, 1.1, 1.2]
# ... and so on
This operation is memory-bound - performance depends on how efficiently you can read from the embedding table and write to the output tensor.
Learning path
This puzzle is structured in two parts to build your understanding systematically:
Simple embedding kernel
Start here to implement the actual puzzle code and understand the kernel implementations.
What you’ll do:
- Complete two different GPU embedding kernels (1D coalesced vs 2D non-coalesced)
- Learn fundamental memory access patterns for GPU programming
- See the same algorithm implemented with different threading strategies
- Understand custom operation registration in Mojo
Performance comparison
Deep dive into why the kernels perform differently and the theory behind memory coalescing.
What you’ll learn:
- Why memory coalescing matters for GPU performance
- How thread organization affects memory bandwidth utilization
- Real-world implications for neural network optimization
- Optimization strategies for memory-bound operations
Getting started
Ready to explore GPU memory optimization? Start with the Simple embedding kernel to implement the code, then move to Performance comparison to understand the performance implications.
💡 Success tip: Pay attention to how the different grid configurations (1D vs 2D) affect memory access patterns - this insight applies to many GPU programming scenarios beyond embeddings.