Puzzle 6: Blocks

Overview

Implement a kernel that adds 10 to each position of vector a and stores it in out.

Note: You have fewer threads per block than the size of a.

Blocks visualization

Key concepts

In this puzzle, you’ll learn about:

  • Processing data larger than thread block size
  • Coordinating multiple blocks of threads
  • Computing global thread positions

The key insight is understanding how blocks of threads work together to process data that’s larger than a single block’s capacity, while maintaining correct element-to-thread mapping.

Code to complete

alias SIZE = 9
alias BLOCKS_PER_GRID = (3, 1)
alias THREADS_PER_BLOCK = (4, 1)
alias dtype = DType.float32


fn add_10_blocks(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p06/p06.mojo

Tips
  1. Calculate global index: i = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if i < size
  3. Inside guard: out[i] = a[i] + 10.0

Running the code

To test your solution, run the following command in your terminal:

uv run poe p06
pixi run p06

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])

Solution

fn add_10_blocks(
    output: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    i = block_dim.x * block_idx.x + thread_idx.x
    if i < size:
        output[i] = a[i] + 10.0


This solution demonstrates key concepts of block-based GPU processing:

  1. Global thread indexing

    • Combines block and thread indices: block_dim.x * block_idx.x + thread_idx.x
    • Maps each thread to a unique global position
    • Example for 3 threads per block:
      Block 0: [0 1 2]
      Block 1: [3 4 5]
      Block 2: [6 7 8]
      
  2. Block coordination

    • Each block processes a contiguous chunk of data
    • Block size (3) < Data size (9) requires multiple blocks
    • Automatic work distribution across blocks:
      Data:    [0 1 2 3 4 5 6 7 8]
      Block 0: [0 1 2]
      Block 1:       [3 4 5]
      Block 2:             [6 7 8]
      
  3. Bounds checking

    • Guard condition i < size handles edge cases
    • Prevents out-of-bounds access when size isn’t perfectly divisible by block size
    • Essential for handling partial blocks at the end of data
  4. Memory access pattern

    • Coalesced memory access: threads in a block access contiguous memory
    • Each thread processes one element: out[i] = a[i] + 10.0
    • Block-level parallelism enables efficient memory bandwidth utilization

This pattern forms the foundation for processing large datasets that exceed the size of a single thread block.