Block Boundary Version

Implement a kernel that computes a 1D convolution between 1D LayoutTensor a and 1D LayoutTensor b and stores it in 1D LayoutTensor output.

Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.

Configuration

Input array size: SIZE_2 = 15 elements
Kernel size: CONV_2 = 4 elements
Threads per block: TPB = 8
Number of blocks: 2
Shared memory: TPB + CONV_2 - 1 elements for input

Notes:

Extended loading: Account for boundary overlap
Block edges: Handle data across block boundaries
Memory layout: Efficient shared memory usage
Synchronization: Proper thread coordination

Code to complete

alias SIZE_2 = 15
alias CONV_2 = 4
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (TPB, 1)
alias in_2_layout = Layout.row_major(SIZE_2)
alias out_2_layout = Layout.row_major(SIZE_2)
alias conv_2_layout = Layout.row_major(CONV_2)


fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    output: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 18 lines)

View full file: problems/p11/p11.mojo

Tips

Use tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc() for shared memory
Load main data: shared_a[local_i] = a[global_i]
Load boundary: if local_i < CONV_2 - 1 handle next block data
Load kernel: shared_b[local_i] = b[local_i]
Sum within input bounds: if global_i + j < SIZE_2

Running the code

To test your solution, run the following command in your terminal:

uv run poe p11 --block-boundary

pixi run p11 --block-boundary

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([14.0, 20.0, 26.0, 32.0, 38.0, 44.0, 50.0, 56.0, 62.0, 68.0, 74.0, 80.0, 41.0, 14.0, 0.0])

Solution

fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    output: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # first: need to account for padding
    shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()
    if global_i < SIZE_2:
        shared_a[local_i] = a[global_i]

    # second: load elements needed for convolution at block boundary
    if local_i < CONV_2 - 1:
        # indices from next block
        next_idx = global_i + TPB
        if next_idx < SIZE_2:
            shared_a[TPB + local_i] = a[next_idx]
        else:
            # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
            # which is an undefined behavior
            shared_a[TPB + local_i] = 0

    if local_i < CONV_2:
        shared_b[local_i] = b[local_i]

    barrier()

    if global_i < SIZE_2:
        var local_sum: output.element_type = 0

        @parameter
        for j in range(CONV_2):
            if global_i + j < SIZE_2:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum

The solution handles block boundary cases in 1D convolution using extended shared memory. Here’s a detailed analysis:

Memory layout and sizing

Test Configuration:
- Full array size: SIZE_2 = 15 elements
- Grid: 2 blocks × 8 threads
- Convolution kernel: CONV_2 = 4 elements

Block 0 shared memory:  [0 1 2 3 4 5 6 7|8 9 10]  // TPB(8) + (CONV_2-1)(3) padding
Block 1 shared memory:  [8 9 10 11 12 13 14|0 0]  // Second block with padding

Size calculation:
- Main data: TPB elements (8)
- Overlap: CONV_2 - 1 elements (4 - 1 = 3)
- Total: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11 elements

Implementation details

Shared Memory Allocation:

# First: account for padding needed for convolution window
shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()

This allocation pattern ensures we have enough space for both the block’s data and the overlap region.

Data Loading Strategy:

# Main block data
if global_i < SIZE_2:
    shared_a[local_i] = a[global_i]

# Boundary data from next block
if local_i < CONV_2 - 1:
    next_idx = global_i + TPB
    if next_idx < SIZE_2:
        shared_a[TPB + local_i] = a[next_idx]
    else:
        # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
        # which is an undefined behavior
        shared_a[TPB + local_i] = 0

Only threads with local_i < CONV_2 - 1 load boundary data
Prevents unnecessary thread divergence
Maintains memory coalescing for main data load
Explicitly zeroes out-of-bounds elements to avoid undefined behavior

Kernel Loading:
```
if local_i < b_size:
    shared_b[local_i] = b[local_i]
```
- Single load per thread
- Bounded by kernel size

Convolution Computation:

if global_i < SIZE_2:
    var local_sum: output.element_type = 0
    @parameter
    for j in range(CONV_2):
        if global_i + j < SIZE_2:
            local_sum += shared_a[local_i + j] * shared_b[j]

Uses @parameter for compile-time loop unrolling
Proper type inference with output.element_type
Semantically correct bounds check: only compute convolution for valid input positions

Memory access pattern analysis

Block 0 Access Pattern:

Thread 0: [0 1 2 3] × [0 1 2 3]
Thread 1: [1 2 3 4] × [0 1 2 3]
Thread 2: [2 3 4 5] × [0 1 2 3]
...
Thread 7: [7 8 9 10] × [0 1 2 3]  // Uses overlap data

Block 1 Access Pattern:

Thread 0: [8 9 10 11] × [0 1 2 3]
Thread 1: [9 10 11 12] × [0 1 2 3]
...
Thread 7: [14 0 0 0] × [0 1 2 3]  // Zero padding at end

Performance optimizations

Memory Coalescing:
- Main data load: Consecutive threads access consecutive memory
- Boundary data: Only necessary threads participate
- Single barrier synchronization point
Thread Divergence Minimization:
- Clean separation of main and boundary loading
- Uniform computation pattern within warps
- Efficient bounds checking
Shared Memory Usage:
- Optimal sizing to handle block boundaries
- No bank conflicts in access pattern
- Efficient reuse of loaded data
Boundary Handling:
- Explicit zero initialization for out-of-bounds elements which prevents reading from uninitialized shared memory
- Semantically correct boundary checking using global_i + j < SIZE_2 instead of shared memory bounds
- Proper handling of edge cases without over-computation

Boundary Condition Improvement

The solution uses if global_i + j < SIZE_2: rather than checking shared memory bounds. This approach is:

Mathematically correct: Only computes convolution where input data actually exists
More efficient: Avoids unnecessary computations for positions beyond the input array
Safer: Prevents reliance on zero-padding behavior in shared memory

This implementation achieves efficient cross-block convolution while maintaining:

Memory safety through proper bounds checking
High performance through optimized memory access
Clean code structure using LayoutTensor abstractions
Minimal synchronization overhead
Mathematically sound boundary handling