Block Boundary Version
Implement a kernel that computes a 1D convolution between 1D LayoutTensor a
and 1D LayoutTensor b
and stores it in 1D LayoutTensor out
.
Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.
Configuration
- Input array size:
SIZE_2 = 15
elements - Kernel size:
CONV_2 = 4
elements - Threads per block:
TPB = 8
- Number of blocks: 2
- Shared memory:
TPB + CONV_2 - 1
elements for input
Notes:
- Extended loading: Account for boundary overlap
- Block edges: Handle data across block boundaries
- Memory layout: Efficient shared memory usage
- Synchronization: Proper thread coordination
Code to complete
alias SIZE_2 = 15
alias CONV_2 = 4
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (TPB, 1)
alias in_2_layout = Layout.row_major(SIZE_2)
alias out_2_layout = Layout.row_major(SIZE_2)
alias conv_2_layout = Layout.row_major(CONV_2)
fn conv_1d_block_boundary[
in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, conv_layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# FILL ME IN (roughly 18 lines)
View full file: problems/p11/p11.mojo
Tips
- Use
tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
for shared memory - Load main data:
shared_a[local_i] = a[global_i]
- Load boundary:
if local_i < CONV_2 - 1
handle next block data - Load kernel:
shared_b[local_i] = b[local_i]
- Sum within extended bounds:
if local_i + j < TPB + CONV_2 - 1
Running the code
To test your solution, run the following command in your terminal:
uv run poe p11 --block-boundary
pixi run p11 --block-boundary
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([14.0, 20.0, 26.0, 32.0, 38.0, 44.0, 50.0, 56.0, 62.0, 68.0, 74.0, 80.0, 41.0, 14.0, 0.0])
Solution
fn conv_1d_block_boundary[
in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
output: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, conv_layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# first: need to account for padding
shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()
if global_i < SIZE_2:
shared_a[local_i] = a[global_i]
# second: load elements needed for convolution at block boundary
if local_i < CONV_2 - 1:
# indices from next block
next_idx = global_i + TPB
if next_idx < SIZE_2:
shared_a[TPB + local_i] = a[next_idx]
else:
# Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
# which is an undefined behavior
shared_a[TPB + local_i] = 0
if local_i < CONV_2:
shared_b[local_i] = b[local_i]
barrier()
if global_i < SIZE_2:
var local_sum: output.element_type = 0
@parameter
for j in range(CONV_2):
if local_i + j < TPB + CONV_2 - 1:
local_sum += shared_a[local_i + j] * shared_b[j]
output[global_i] = local_sum
The solution handles block boundary cases in 1D convolution using extended shared memory. Here’s a detailed analysis:
Memory layout and sizing
Test Configuration:
- Full array size: SIZE_2 = 15 elements
- Grid: 2 blocks × 8 threads
- Convolution kernel: CONV_2 = 4 elements
Block 0 shared memory: [0 1 2 3 4 5 6 7|8 9 10] // TPB(8) + (CONV_2-1)(3) padding
Block 1 shared memory: [8 9 10 11 12 13 14|0 0] // Second block with padding
Size calculation:
- Main data: TPB elements (8)
- Overlap: CONV_2 - 1 elements (4 - 1 = 3)
- Total: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11 elements
Implementation details
-
Shared Memory Allocation:
# First: account for padding needed for convolution window shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc() shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()
This allocation pattern ensures we have enough space for both the block’s data and the overlap region.
-
Data Loading Strategy:
# Main block data if global_i < a_size: shared_a[local_i] = a[global_i] # Boundary data from next block if local_i < CONV_2 - 1: next_idx = global_i + TPB if next_idx < a_size: shared_a[TPB + local_i] = a[next_idx] else: # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory # which is an undefined behavior shared_a[TPB + local_i] = 0
- Only threads with
local_i < CONV_2 - 1
load boundary data - Prevents unnecessary thread divergence
- Maintains memory coalescing for main data load
- Explicitly zeroes out-of-bounds elements to avoid undefined behavior
- Only threads with
-
Kernel Loading:
if local_i < b_size: shared_b[local_i] = b[local_i]
- Single load per thread
- Bounded by kernel size
-
Convolution Computation:
if global_i < a_size: var local_sum: out.element_type = 0 @parameter for j in range(CONV_2): if local_i + j < TPB + CONV_2 - 1: local_sum += shared_a[local_i + j] * shared_b[j]
- Uses
@parameter
for compile-time loop unrolling - Proper type inference with
out.element_type
- Extended bounds check for overlap region
- Uses
Memory access pattern analysis
-
Block 0 Access Pattern:
Thread 0: [0 1 2 3] × [0 1 2 3] Thread 1: [1 2 3 4] × [0 1 2 3] Thread 2: [2 3 4 5] × [0 1 2 3] ... Thread 7: [7 8 9 10] × [0 1 2 3] // Uses overlap data
-
Block 1 Access Pattern:
Thread 0: [8 9 10 11] × [0 1 2 3] Thread 1: [9 10 11 12] × [0 1 2 3] ... Thread 7: [14 0 0 0] × [0 1 2 3] // Zero padding at end
Performance optimizations
-
Memory Coalescing:
- Main data load: Consecutive threads access consecutive memory
- Boundary data: Only necessary threads participate
- Single barrier synchronization point
-
Thread Divergence Minimization:
- Clean separation of main and boundary loading
- Uniform computation pattern within warps
- Efficient bounds checking
-
Shared Memory Usage:
- Optimal sizing to handle block boundaries
- No bank conflicts in access pattern
- Efficient reuse of loaded data
-
Boundary Handling:
- Explicit zero initialization for out-of-bounds elements which prevents reading from uninitialized shared memory
- Proper handling of edge cases
This implementation achieves efficient cross-block convolution while maintaining:
- Memory safety through proper bounds checking
- High performance through optimized memory access
- Clean code structure using LayoutTensor abstractions
- Minimal synchronization overhead