Puzzle 6: Blocks
Overview
Implement a kernel that adds 10 to each position of vector a
and stores it in out
.
Note: You have fewer threads per block than the size of a.
Key concepts
In this puzzle, you’ll learn about:
- Processing data larger than thread block size
- Coordinating multiple blocks of threads
- Computing global thread positions
The key insight is understanding how blocks of threads work together to process data that’s larger than a single block’s capacity, while maintaining correct element-to-thread mapping.
Code to complete
alias SIZE = 9
alias BLOCKS_PER_GRID = (3, 1)
alias THREADS_PER_BLOCK = (4, 1)
alias dtype = DType.float32
fn add_10_blocks(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
i = block_dim.x * block_idx.x + thread_idx.x
# FILL ME IN (roughly 2 lines)
View full file: problems/p06/p06.mojo
Tips
- Calculate global index:
i = block_dim.x * block_idx.x + thread_idx.x
- Add guard:
if i < size
- Inside guard:
out[i] = a[i] + 10.0
Running the code
To test your solution, run the following command in your terminal:
uv run poe p06
pixi run p06
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])
Solution
fn add_10_blocks(
output: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
i = block_dim.x * block_idx.x + thread_idx.x
if i < size:
output[i] = a[i] + 10.0
This solution demonstrates key concepts of block-based GPU processing:
-
Global thread indexing
- Combines block and thread indices:
block_dim.x * block_idx.x + thread_idx.x
- Maps each thread to a unique global position
- Example for 3 threads per block:
Block 0: [0 1 2] Block 1: [3 4 5] Block 2: [6 7 8]
- Combines block and thread indices:
-
Block coordination
- Each block processes a contiguous chunk of data
- Block size (3) < Data size (9) requires multiple blocks
- Automatic work distribution across blocks:
Data: [0 1 2 3 4 5 6 7 8] Block 0: [0 1 2] Block 1: [3 4 5] Block 2: [6 7 8]
-
Bounds checking
- Guard condition
i < size
handles edge cases - Prevents out-of-bounds access when size isn’t perfectly divisible by block size
- Essential for handling partial blocks at the end of data
- Guard condition
-
Memory access pattern
- Coalesced memory access: threads in a block access contiguous memory
- Each thread processes one element:
out[i] = a[i] + 10.0
- Block-level parallelism enables efficient memory bandwidth utilization
This pattern forms the foundation for processing large datasets that exceed the size of a single thread block.