1. Getting Started
  2. 🔥 Introduction
  3. 🧭 Puzzles Usage Guide
  4. Part I: GPU Fundamentals
  5. Puzzle 1: Map
    1. 🔰 Raw Memory Approach
    2. 💡 Preview: Modern Approach with LayoutTensor
  6. Puzzle 2: Zip
  7. Puzzle 3: Guards
  8. Puzzle 4: 2D Map
    1. 🔰 Raw Memory Approach
    2. 📚 Learn about LayoutTensor
    3. 🚀 Modern 2D Operations
  9. Puzzle 5: Broadcast
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  10. Puzzle 6: Blocks
  11. Puzzle 7: 2D Blocks
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  12. Puzzle 8: Shared Memory
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  13. Part II: 🧮 GPU Algorithms
  14. Puzzle 9: Pooling
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  15. Puzzle 10: Dot Product
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  16. Puzzle 11: 1D Convolution
    1. 🔰 Simple Version
    2. ⭐ Block Boundary Version
  17. Puzzle 12: Prefix Sum
    1. 🔰 Simple Version
    2. ⭐ Complete Version
  18. Puzzle 13: Axis Sum
  19. Puzzle 14: Matrix Multiplication (MatMul)
    1. 🔰 Naïve Version with Global Memory
    2. 📚 Learn about Roofline Model
    3. 🤝 Shared Memory Version
    4. 📐 Tiled Version
  20. Part III: 🐍 Interfacing with Python via MAX Graph Custom Ops
  21. Puzzle 15: 1D Convolution Op
  22. Puzzle 16: Softmax Op
  23. Puzzle 17: Attention Op
  24. 🎯 Bonus Challenges
  25. Part IV: 🔥 PyTorch Custom Ops Integration
  26. Puzzle 18: PyTorch Custom Op Basics
  27. Puzzle 19: Integration with torch.compile
  28. Part V: 🌊 Mojo Functional Patterns and Benchmarking
  29. Puzzle 20: GPU Functional Programming Patterns
    1. elementwise - Basic GPU Functional Operations
    2. tile - Memory-Efficient Tiled Processing
    3. Vectorization - SIMD Control
    4. 🧠 GPU Threading vs SIMD Concepts
    5. 📊 Benchmarking in Mojo
  30. Part VI: ⚡ Warp-Level Programming
  31. Puzzle 21: Warp Fundamentals
    1. Warp lanes & SIMT execution
    2. warp.sum() Essentials
    3. 📊 When to Use Warp Programming
  32. Puzzle 22: Essential Warp Operations
    1. 🔄 warp.shuffle_down() Communication
    2. 🔀 warp.shuffle_xor() Butterfly Patterns
    3. 📡 warp.broadcast() Distribution
  33. Puzzle 23: Advanced Warp Patterns
    1. 🧮 warp.prefix_sum() Scan Operations
    2. lane_group_* Sub-warp Operations
    3. Combining with Functional Patterns
  34. 📋 Quick Reference: Essential Warp Operations
  35. Part VII: 🧠 Advanced Memory Operations
  36. Puzzle 24: Memory Coalescing
    1. 📚 Understanding Coalesced Access
    2. Optimized Access Patterns
    3. 🔧 Troubleshooting Memory Issues
  37. Puzzle 25: Async Memory Operations
  38. Puzzle 26: Memory Fences & Atomics
  39. Puzzle 27: Prefetching & Caching
  40. Part VIII: 📊 Performance Analysis & Optimization
  41. Puzzle 28: GPU Profiling Basics
  42. Puzzle 29: Occupancy Optimization
  43. Puzzle 30: Bank Conflicts
    1. 📚 Understanding Shared Memory Banks
    2. Conflict-Free Patterns
  44. Part IX: 🚀 Advanced GPU Features
  45. Puzzle 31: Tensor Core Operations
  46. Puzzle 32: Random Number Generation
  47. Puzzle 33: Advanced Synchronization
  48. Part X: 🌐 Multi-GPU & Advanced Applications
  49. Puzzle 34: Multi-Stream Programming
  50. Puzzle 35: Multi-GPU Basics
  51. Puzzle 36: End-to-End Optimization Case Study
  52. 🎯 Advanced Bonus Challenges