Puzzle 7: 2D Blocks
Overview
Implement a kernel that adds 10 to each position of matrix a
and stores it in out
.
Note: You have fewer threads per block than the size of a
in both directions.
Key concepts
- Block-based processing
- Grid-block coordination
- Multi-block indexing
- Memory access patterns
🔑 2D thread indexing convention
We extend the block-based indexing from puzzle 04 to 2D:
Global position calculation: row = block_dim.y * block_idx.y + thread_idx.y col = block_dim.x * block_idx.x + thread_idx.x
For example, with 2×2 blocks in a 4×4 grid:
Block (0,0): Block (1,0): [0,0 0,1] [0,2 0,3] [1,0 1,1] [1,2 1,3] Block (0,1): Block (1,1): [2,0 2,1] [2,2 2,3] [3,0 3,1] [3,2 3,3]
Each position shows (row, col) for that thread’s global index. The block dimensions and indices work together to ensure:
- Continuous coverage of the 2D space
- No overlap between blocks
- Efficient memory access patterns
Implementation approaches
🔰 Raw memory approach
Learn how to handle multi-block operations with manual indexing.
📐 LayoutTensor Version
Use LayoutTensor features to elegantly handle block-based processing.
💡 Note: See how LayoutTensor simplifies block coordination and memory access patterns.