- Getting Started
- 🔥 Introduction
- 🧭 Puzzles Usage Guide
- Part I: GPU Fundamentals
- Puzzle 1: Map
- 🔰 Raw Memory Approach
- 💡 Preview: Modern Approach with LayoutTensor
- Puzzle 2: Zip
- Puzzle 3: Guards
- Puzzle 4: 2D Map
- 🔰 Raw Memory Approach
- 📚 Learn about LayoutTensor
- 🚀 Modern 2D Operations
- Puzzle 5: Broadcast
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 6: Blocks
- Puzzle 7: 2D Blocks
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 8: Shared Memory
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Part II: 🧮 GPU Algorithms
- Puzzle 9: Pooling
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 10: Dot Product
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 11: 1D Convolution
- 🔰 Simple Version
- ⭐ Block Boundary Version
- Puzzle 12: Prefix Sum
- 🔰 Simple Version
- ⭐ Complete Version
- Puzzle 13: Axis Sum
- Puzzle 14: Matrix Multiplication (MatMul)
- 🔰 Naïve Version with Global Memory
- 📚 Learn about Roofline Model
- 🤝 Shared Memory Version
- 📐 Tiled Version
- Part III: 🐍 Interfacing with Python via MAX Graph Custom Ops
- Puzzle 15: 1D Convolution Op
- Puzzle 16: Softmax Op
- Puzzle 17: Attention Op
- 🎯 Bonus Challenges
- Part IV: 🔥 PyTorch Custom Ops Integration
- Puzzle 18: 1D Convolution Op
- Puzzle 19: Embedding Op
- 🔰 Coaleasced vs non-Coaleasced Kernel
- 📊 Performance Comparison
- Puzzle 20: Kernel Fusion and Custom Backward Pass
- ⚛️ Fused vs Unfused Kernels
- ⛓️ Autograd Integration & Backward Pass
- Part V: 🌊 Mojo Functional Patterns and Benchmarking
- Puzzle 21: GPU Functional Programming Patterns
- elementwise - Basic GPU Functional Operations
- tile - Memory-Efficient Tiled Processing
- Vectorization - SIMD Control
- 🧠 GPU Threading vs SIMD Concepts
- 📊 Benchmarking in Mojo
- Part VI: ⚡ Warp-Level Programming
- Puzzle 22: Warp Fundamentals
- 🧠 Warp lanes & SIMT execution
- 🔰 warp.sum() Essentials
- 📊 When to Use Warp Programming
- Puzzle 23: Warp Communication
- ⬇️ warp.shuffle_down()
- 📢 warp.broadcast()
- Puzzle 24: Advanced Warp Patterns
- 🦋 warp.shuffle_xor() Butterfly Networks
- 🔢 warp.prefix_sum() Scan Operations
- Part VII: Advanced Memory Operations
Puzzle 25: Memory Coalescing
📚 Understanding Coalesced Access
Optimized Access Patterns
🔧 Troubleshooting Memory Issues
Puzzle 26: Async Memory Operations
Puzzle 27: Memory Fences & Atomics
Puzzle 28: Prefetching & Caching
- Part VIII: 📊 Performance Analysis & Optimization
Puzzle 29: GPU Profiling Basics
Puzzle 30: Occupancy Optimization
Puzzle 31: Bank Conflicts
📚 Understanding Shared Memory Banks
Conflict-Free Patterns
- Part IX: 🚀 Advanced GPU Features
Puzzle 32: Tensor Core Operations
Puzzle 33: Random Number Generation
Puzzle 34: Advanced Synchronization
- Part X: 🌐 Multi-GPU & Advanced Applications
Puzzle 35: Multi-Stream Programming
Puzzle 36: Multi-GPU Basics
Puzzle 37: End-to-End Optimization Case Study
🎯 Advanced Bonus Challenges