Overview
Implement a kernel that adds 10 to each position of a 1D ayoutTensor a
and stores it in 1D LayoutTensor output
.
Note: You have fewer threads per block than the size of a
.
Key concepts
In this puzzle, youāll learn about:
- Using LayoutTensorās shared memory features
- Thread synchronization with shared memory
- Block-local data management with tensor builder
The key insight is how LayoutTensor simplifies shared memory management while maintaining the performance benefits of block-local storage.
Configuration
- Array size:
SIZE = 8
elements - Threads per block:
TPB = 4
- Number of blocks: 2
- Shared memory:
TPB
elements per block
Key differences from raw approach
-
Memory allocation: We will use LayoutTensorBuild instead of stack_allocation
# Raw approach shared = stack_allocation[TPB, Scalar[dtype]]() # LayoutTensor approach shared = LayoutTensorBuild[dtype]().row_major[TPB]().shared().alloc()
-
Memory access: Same syntax
# Raw approach shared[local_i] = a[global_i] # LayoutTensor approach shared[local_i] = a[global_i]
-
Safety features:
- Type safety
- Layout management
- Memory alignment handling
Note: LayoutTensor handles memory layout, but you still need to manage thread synchronization with
barrier()
when using shared memory.
Educational Note: In this specific puzzle, the barrier()
isnāt strictly necessary since each thread only accesses its own shared memory location. However, itās included to teach proper shared memory synchronization patterns for more complex scenarios where threads need to coordinate access to shared data.
Code to complete
alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
fn add_10_shared_layout_tensor[
layout: Layout
](
output: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i]
barrier()
# FILL ME IN (roughly 2 lines)
View full file: problems/p08/p08_layout_tensor.mojo
Tips
- Create shared memory with tensor builder
- Load data with natural indexing:
shared[local_i] = a[global_i]
- Synchronize with
barrier()
(educational - not strictly needed here) - Process data using shared memory indices
- Guard against out-of-bounds access
Running the code
To test your solution, run the following command in your terminal:
uv run poe p08_layout_tensor
pixi run p08_layout_tensor
Your output will look like this if the puzzle isnāt solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])
Solution
fn add_10_shared_layout_tensor[
layout: Layout
](
output: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i]
# Note: barrier is not strictly needed here since each thread only accesses its own shared memory location.
# However, it's included to teach proper shared memory synchronization patterns
# for more complex scenarios where threads need to coordinate access to shared data.
# For this specific puzzle, we can remove the barrier since each thread only accesses its own shared memory location.
barrier()
if global_i < size:
output[global_i] = shared[local_i] + 10
This solution demonstrates how LayoutTensor simplifies shared memory usage while maintaining performance:
-
Memory hierarchy with LayoutTensor
- Global tensors:
a
andoutput
(slow, visible to all blocks) - Shared tensor:
shared
(fast, thread-block local) - Example for 8 elements with 4 threads per block:
Global tensor a: [1 1 1 1 | 1 1 1 1] # Input: all ones Block (0): Block (1): shared[0..3] shared[0..3] [1 1 1 1] [1 1 1 1]
- Global tensors:
-
Thread coordination
- Load phase with natural indexing:
Thread 0: shared[0] = a[0]=1 Thread 2: shared[2] = a[2]=1 Thread 1: shared[1] = a[1]=1 Thread 3: shared[3] = a[3]=1 barrier() ā ā ā ā # Wait for all loads
- Process phase: Each thread adds 10 to its shared tensor value
- Result:
output[global_i] = shared[local_i] + 10 = 11
Note: In this specific case, the
barrier()
isnāt strictly necessary since each thread only writes to and reads from its own shared memory location (shared[local_i]
). However, itās included for educational purposes to demonstrate proper shared memory synchronization patterns that are essential when threads need to access each otherās data. - Load phase with natural indexing:
-
LayoutTensor benefits
- Shared memory allocation:
# Clean tensor builder API shared = tb[dtype]().row_major[TPB]().shared().alloc()
- Natural indexing for both global and shared:
Block 0 output: [11 11 11 11] Block 1 output: [11 11 11 11]
- Built-in layout management and type safety
- Shared memory allocation:
-
Memory access pattern
- Load: Global tensor ā Shared tensor (optimized)
- Sync: Same
barrier()
requirement as raw version - Process: Add 10 to shared values
- Store: Write 11s back to global tensor
This pattern shows how LayoutTensor maintains the performance benefits of shared memory while providing a more ergonomic API and built-in features.