- Getting Started
- 🔥 Introduction
- 🧭 Puzzles Usage Guide
- Part I: GPU Fundamentals
- Puzzle 1: Map
- 🔰 Raw Memory Approach
- 💡 Preview: Modern Approach with LayoutTensor
- Puzzle 2: Zip
- Puzzle 3: Guards
- Puzzle 4: 2D Map
- 🔰 Raw Memory Approach
- 📚 Learn about LayoutTensor
- 🚀 Modern 2D Operations
- Puzzle 5: Broadcast
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 6: Blocks
- Puzzle 7: 2D Blocks
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 8: Shared Memory
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Part II: 🧮 GPU Algorithms
- Puzzle 9: Pooling
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 10: Dot Product
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 11: 1D Convolution
- 🔰 Simple Version
- ⭐ Block Boundary Version
- Puzzle 12: Prefix Sum
- 🔰 Simple Version
- ⭐ Complete Version
- Puzzle 13: Axis Sum
- Puzzle 14: Matrix Multiplication (MatMul)
- 🔰 Naïve Version with Global Memory
- 📚 Learn about Roofline Model
- 🤝 Shared Memory Version
- 📐 Tiled Version
- Part III: 🐍 Interfacing with Python via MAX Graph Custom Ops
- Puzzle 15: 1D Convolution Op
- Puzzle 16: Softmax Op
- Puzzle 17: Attention Op
- 🎯 Bonus Challenges
- Part IV: 🔥 PyTorch Custom Ops Integration
Puzzle 18: PyTorch Custom Op Basics
Puzzle 19: Integration with torch.compile
- Part V: 🌊 Mojo Functional Patterns and Benchmarking
- Puzzle 20: GPU Functional Programming Patterns
- elementwise - Basic GPU Functional Operations
- tile - Memory-Efficient Tiled Processing
- Vectorization - SIMD Control
- 🧠 GPU Threading vs SIMD Concepts
- 📊 Benchmarking in Mojo
- Part VI: ⚡ Warp-Level Programming
- Puzzle 21: Warp Fundamentals
- Warp lanes & SIMT execution
- warp.sum() Essentials
- 📊 When to Use Warp Programming
Puzzle 22: Essential Warp Operations
🔄 warp.shuffle_down() Communication
🔀 warp.shuffle_xor() Butterfly Patterns
📡 warp.broadcast() Distribution
Puzzle 23: Advanced Warp Patterns
🧮 warp.prefix_sum() Scan Operations
lane_group_* Sub-warp Operations
Combining with Functional Patterns
📋 Quick Reference: Essential Warp Operations
- Part VII: 🧠 Advanced Memory Operations
Puzzle 24: Memory Coalescing
📚 Understanding Coalesced Access
Optimized Access Patterns
🔧 Troubleshooting Memory Issues
Puzzle 25: Async Memory Operations
Puzzle 26: Memory Fences & Atomics
Puzzle 27: Prefetching & Caching
- Part VIII: 📊 Performance Analysis & Optimization
Puzzle 28: GPU Profiling Basics
Puzzle 29: Occupancy Optimization
Puzzle 30: Bank Conflicts
📚 Understanding Shared Memory Banks
Conflict-Free Patterns
- Part IX: 🚀 Advanced GPU Features
Puzzle 31: Tensor Core Operations
Puzzle 32: Random Number Generation
Puzzle 33: Advanced Synchronization
- Part X: 🌐 Multi-GPU & Advanced Applications
Puzzle 34: Multi-Stream Programming
Puzzle 35: Multi-GPU Basics
Puzzle 36: End-to-End Optimization Case Study
🎯 Advanced Bonus Challenges