🔧 Vectorization - Fine-Grained SIMD Control

Overview

This puzzle explores advanced vectorization techniques using manual vectorization and vectorize that give you precise control over SIMD operations within GPU kernels. You’ll implement two different approaches to vectorized computation:

  1. Manual vectorization: Direct SIMD control with explicit index calculations
  2. Mojo’s vectorize function: High-level vectorization with automatic bounds checking

Both approaches build on tiling concepts but with different trade-offs between control, safety, and performance optimization.

Key insight: Different vectorization strategies suit different performance requirements and complexity levels.

Key concepts

In this puzzle, you’ll master:

  • Manual SIMD operations with explicit index management
  • Mojo’s vectorize function for safe, automatic vectorization
  • Chunk-based memory organization for optimal SIMD alignment
  • Bounds checking strategies for edge cases
  • Performance trade-offs between manual control and safety

The same mathematical operation as before: \[\Large \text{out}[i] = a[i] + b[i]\]

But with sophisticated vectorization strategies for maximum performance.

Configuration

  • Vector size: SIZE = 1024
  • Tile size: TILE_SIZE = 32
  • Data type: DType.float32
  • SIMD width: GPU-dependent
  • Layout: Layout.row_major(SIZE) (1D row-major)

1. Manual vectorization approach

Code to complete

fn manual_vectorized_tiled_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size groups of simd_width elements
    alias chunk_size = tile_size * simd_width

    @parameter
    @always_inline
    fn process_manual_vectorized_tiles[
        num_threads_per_tile: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        print("tile_id:", tile_id)
        out_tile = out.tile[chunk_size](tile_id)
        a_tile = a.tile[chunk_size](tile_id)
        b_tile = b.tile[chunk_size](tile_id)

        # FILL IN (7 lines at most)

    # Number of tiles needed: each tile processes chunk_size elements
    num_tiles = (size + chunk_size - 1) // chunk_size
    elementwise[
        process_manual_vectorized_tiles, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


View full file: problems/p20/p20.mojo

Tips

1. Understanding chunk organization

alias chunk_size = tile_size * simd_width  # 32 * 4 = 128 elements per chunk

Each tile now contains multiple SIMD groups, not just sequential elements.

2. Global index calculation

global_start = tile_id * chunk_size + i * simd_width

This calculates the exact global position for each SIMD vector within the chunk.

3. Direct tensor access

a_vec = a.load[simd_width](global_start, 0)  # Load from global tensor
out.store[simd_width](global_start, 0, ret)  # Store to global tensor

Note: Access the original tensors, not the tile views.

4. Key characteristics

  • More control, more complexity, global tensor access
  • Perfect SIMD alignment with hardware
  • Manual bounds checking required

Running manual vectorization

uv run poe p20 --manual-vectorized
pixi run p20 --manual-vectorized

Your output will look like this when not yet solved:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0
tile_id: 1
tile_id: 2
tile_id: 3
tile_id: 4
tile_id: 5
tile_id: 6
tile_id: 7
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

Manual vectorization solution

fn manual_vectorized_tiled_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size groups of simd_width elements
    alias chunk_size = tile_size * simd_width

    @parameter
    @always_inline
    fn process_manual_vectorized_tiles[
        num_threads_per_tile: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]

        output_tile = output.tile[chunk_size](tile_id)
        a_tile = a.tile[chunk_size](tile_id)
        b_tile = b.tile[chunk_size](tile_id)

        @parameter
        for i in range(tile_size):
            global_start = tile_id * chunk_size + i * simd_width

            a_vec = a.load[simd_width](global_start, 0)
            b_vec = b.load[simd_width](global_start, 0)
            ret = a_vec + b_vec
            # print("tile:", tile_id, "simd_group:", i, "global_start:", global_start, "a_vec:", a_vec, "b_vec:", b_vec, "result:", ret)

            output.store[simd_width](global_start, 0, ret)

    # Number of tiles needed: each tile processes chunk_size elements
    num_tiles = (size + chunk_size - 1) // chunk_size
    elementwise[
        process_manual_vectorized_tiles, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


Manual vectorization deep dive

Manual vectorization gives you direct control over SIMD operations with explicit index calculations:

  • Chunk-based organization: chunk_size = tile_size * simd_width
  • Global indexing: Direct calculation of memory positions
  • Manual bounds management: You handle edge cases explicitly

Architecture and memory layout:

alias chunk_size = tile_size * simd_width  # 32 * 4 = 128

Chunk organization visualization (TILE_SIZE=32, SIMD_WIDTH=4):

Original array: [0, 1, 2, 3, ..., 1023]

Chunk 0 (thread 0): [0:128]    ← 128 elements = 32 SIMD groups of 4
Chunk 1 (thread 1): [128:256]  ← Next 128 elements
Chunk 2 (thread 2): [256:384]  ← Next 128 elements
...
Chunk 7 (thread 7): [896:1024] ← Final 128 elements

Processing within one chunk:

@parameter
for i in range(tile_size):  # i = 0, 1, 2, ..., 31
    global_start = tile_id * chunk_size + i * simd_width
    # For tile_id=0: global_start = 0, 4, 8, 12, ..., 124
    # For tile_id=1: global_start = 128, 132, 136, 140, ..., 252

Performance characteristics:

  • Thread count: 8 threads (1024 ÷ 128 = 8)
  • Work per thread: 128 elements (32 SIMD operations of 4 elements each)
  • Memory pattern: Large chunks with perfect SIMD alignment
  • Overhead: Minimal - direct hardware mapping
  • Safety: Manual bounds checking required

Key advantages:

  • Predictable indexing: Exact control over memory access patterns
  • Optimal alignment: SIMD operations perfectly aligned to hardware
  • Maximum throughput: No overhead from safety checks
  • Hardware optimization: Direct mapping to GPU SIMD units

Key challenges:

  • Index complexity: Manual calculation of global positions
  • Bounds responsibility: Must handle edge cases explicitly
  • Debugging difficulty: More complex to verify correctness

2. Mojo vectorize approach

Code to complete

fn vectorize_within_tiles_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    out: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size elements (not SIMD groups)
    @parameter
    @always_inline
    fn process_tile_with_vectorize[
        num_threads_per_tile: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        tile_start = tile_id * tile_size
        tile_end = min(tile_start + tile_size, size)
        actual_tile_size = tile_end - tile_start
        print(
            "tile_id:",
            tile_id,
            "tile_start:",
            tile_start,
            "tile_end:",
            tile_end,
            "actual_tile_size:",
            actual_tile_size,
        )

        # FILL IN (9 lines at most)

    num_tiles = (size + tile_size - 1) // tile_size
    elementwise[
        process_tile_with_vectorize, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


View full file: problems/p20/p20.mojo

Tips

1. Tile boundary calculation

tile_start = tile_id * tile_size
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start

Handle cases where the last tile might be smaller than tile_size.

2. Vectorized function pattern

@parameter
fn vectorized_add[width: Int](i: Int):
    global_idx = tile_start + i
    if global_idx + width <= size:  # Bounds checking
        # SIMD operations here

The width parameter is automatically determined by the vectorize function.

3. Calling vectorize

vectorize[vectorized_add, simd_width](actual_tile_size)

This automatically handles the vectorization loop with the provided SIMD width.

4. Key characteristics

  • Automatic remainder handling, built-in safety, tile-based access
  • Takes explicit SIMD width parameter
  • Built-in bounds checking and automatic remainder element processing

Running Mojo vectorize

uv run poe p20 --vectorized
pixi run p20 --vectorized

Your output will look like this when not yet solved:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0 tile_start: 0 tile_end: 32 actual_tile_size: 32
tile_id: 1 tile_start: 32 tile_end: 64 actual_tile_size: 32
tile_id: 2 tile_start: 64 tile_end: 96 actual_tile_size: 32
tile_id: 3 tile_start: 96 tile_end: 128 actual_tile_size: 32
...
tile_id: 29 tile_start: 928 tile_end: 960 actual_tile_size: 32
tile_id: 30 tile_start: 960 tile_end: 992 actual_tile_size: 32
tile_id: 31 tile_start: 992 tile_end: 1024 actual_tile_size: 32
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

Mojo vectorize solution

fn vectorize_within_tiles_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size elements (not SIMD groups)
    @parameter
    @always_inline
    fn process_tile_with_vectorize[
        num_threads_per_tile: Int, rank: Int
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        tile_start = tile_id * tile_size
        tile_end = min(tile_start + tile_size, size)
        actual_tile_size = tile_end - tile_start

        @parameter
        fn vectorized_add[width: Int](i: Int):
            global_idx = tile_start + i
            if global_idx + width <= size:
                a_vec = a.load[width](global_idx, 0)
                b_vec = b.load[width](global_idx, 0)
                result = a_vec + b_vec
                output.store[width](global_idx, 0, result)

        # Use vectorize within each tile
        vectorize[vectorized_add, simd_width](actual_tile_size)

    num_tiles = (size + tile_size - 1) // tile_size
    elementwise[
        process_tile_with_vectorize, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


Mojo vectorize deep dive

Mojo’s vectorize function provides automatic vectorization with built-in safety:

  • Explicit SIMD width parameter: You provide the simd_width to use
  • Built-in bounds checking: Prevents buffer overruns automatically
  • Automatic remainder handling: Processes leftover elements automatically
  • Nested function pattern: Clean separation of vectorization logic

Tile-based organization:

tile_start = tile_id * tile_size    # 0, 32, 64, 96, ...
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start

Automatic vectorization mechanism:

@parameter
fn vectorized_add[width: Int](i: Int):
    global_idx = tile_start + i
    if global_idx + width <= size:
        # Automatic SIMD optimization

How vectorize works:

  • Automatic chunking: Divides actual_tile_size into chunks of your provided simd_width
  • Remainder handling: Automatically processes leftover elements with smaller widths
  • Bounds safety: Automatically prevents buffer overruns
  • Loop management: Handles the vectorization loop automatically

Execution visualization (TILE_SIZE=32, SIMD_WIDTH=4):

Tile 0 processing:
  vectorize call 0: processes elements [0:4]   with SIMD_WIDTH=4
  vectorize call 1: processes elements [4:8]   with SIMD_WIDTH=4
  ...
  vectorize call 7: processes elements [28:32] with SIMD_WIDTH=4
  Total: 8 automatic SIMD operations

Performance characteristics:

  • Thread count: 32 threads (1024 ÷ 32 = 32)
  • Work per thread: 32 elements (automatic SIMD chunking)
  • Memory pattern: Smaller tiles with automatic vectorization
  • Overhead: Slight - automatic optimization and bounds checking
  • Safety: Built-in bounds checking and edge case handling

Performance comparison and best practices

When to use each approach

Choose manual vectorization when:

  • Maximum performance is critical
  • You have predictable, aligned data patterns
  • Expert-level control over memory access is needed
  • You can guarantee bounds safety manually
  • Hardware-specific optimization is required

Choose Mojo vectorize when:

  • Development speed and safety are priorities
  • Working with irregular or dynamic data sizes
  • You want automatic remainder handling instead of manual edge case management
  • Bounds checking complexity would be error-prone
  • You prefer cleaner vectorization patterns over manual loop management

Advanced optimization insights

Memory bandwidth utilization:

Manual:    8 threads × 32 SIMD ops = 256 total SIMD operations
Vectorize: 32 threads × 8 SIMD ops = 256 total SIMD operations

Both achieve similar total throughput but with different parallelism strategies.

Cache behavior:

  • Manual: Large chunks may exceed L1 cache, but perfect sequential access
  • Vectorize: Smaller tiles fit better in cache, with automatic remainder handling

Hardware mapping:

  • Manual: Direct control over warp utilization and SIMD unit mapping
  • Vectorize: Simplified vectorization with automatic loop and remainder management

Best practices summary

Manual vectorization best practices:

  • Always validate index calculations carefully
  • Use compile-time constants for chunk_size when possible
  • Profile memory access patterns for cache optimization
  • Consider alignment requirements for optimal SIMD performance

Mojo vectorize best practices:

  • Choose appropriate SIMD width for your data and hardware
  • Focus on algorithm clarity over micro-optimizations
  • Use nested parameter functions for clean vectorization logic
  • Trust automatic bounds checking and remainder handling for edge cases

Both approaches represent valid strategies in the GPU performance optimization toolkit, with manual vectorization offering maximum control and Mojo’s vectorize providing safety and automatic remainder handling.

Next steps

Now that you understand all three fundamental patterns:

💡 Key takeaway: Different vectorization strategies suit different performance requirements. Manual vectorization gives maximum control, while Mojo’s vectorize function provides safety and automatic remainder handling. Choose based on your specific performance needs and development constraints.