Introduction to LayoutTensor
Letās take a quick break from solving puzzles to preview a powerful abstraction that will make our GPU programming journey more enjoyable: š„ ⦠the LayoutTensor.
š” This is a motivational overview of LayoutTensorās capabilities. Donāt worry about understanding everything now - weāll explore each feature in depth as we progress through the puzzles.
The challenge: Growing complexity
Letās look at the challenges weāve faced so far:
# Puzzle 1: Simple indexing
out[i] = a[i] + 10.0
# Puzzle 2: Multiple array management
out[i] = a[i] + b[i]
# Puzzle 3: Bounds checking
if i < size:
out[i] = a[i] + 10.0
As dimensions grow, code becomes more complex:
# Traditional 2D indexing for row-major 2D matrix
idx = row * WIDTH + col
if row < height and col < width:
out[idx] = a[idx] + 10.0
The solution: A peek at LayoutTensor
LayoutTensor will help us tackle these challenges with elegant solutions. Hereās a glimpse of whatās coming:
- Natural Indexing: Use
tensor[i, j]
instead of manual offset calculations - Automatic Bounds Checking: Built-in protection against out-of-bounds access
- Flexible Memory Layouts: Support for row-major, column-major, and tiled organizations
- Performance Optimization: Efficient memory access patterns for GPU
A taste of whatās ahead
Letās look at a few examples of what LayoutTensor can do. Donāt worry about understanding all the details now - weāll cover each feature thoroughly in upcoming puzzles.
Basic usage example
from layout import Layout, LayoutTensor
# Define layout
alias HEIGHT = 2
alias WIDTH = 3
alias layout = Layout.row_major(HEIGHT, WIDTH)
# Create tensor
tensor = LayoutTensor[dtype, layout](buffer.unsafe_ptr())
# Access elements naturally
tensor[0, 0] = 1.0 # First element
tensor[1, 2] = 2.0 # Last element
Preview of advanced features
As we progress through the puzzles, youāll learn about:
- Shared memory optimizations
- Efficient tiling strategies
- Vectorized operations
- Hardware acceleration
- Optimized memory access patterns
# Column-major layout
layout_col = Layout.col_major(HEIGHT, WIDTH)
# Tiled layout (for better cache utilization)
layout_tiled = tensor.tiled[4, 4](HEIGHT, WIDTH)
Each layout has its advantages:
-
Row-major: Elements in a row are contiguous
# [1 2 3] # [4 5 6] -> [1 2 3 4 5 6] layout_row = Layout.row_major(2, 3)
-
Column-major: Elements in a column are contiguous
# [1 2 3] # [4 5 6] -> [1 4 2 5 3 6] layout_col = Layout.col_major(2, 3)
-
Tiled: Elements grouped in tiles for cache efficiency
# [[1 2] [3 4]] in 2x2 tiles layout_tiled = Layout.tiled[2, 2](4, 4)
Advanced GPU optimizations
As you progress, youāll discover LayoutTensorās powerful features for GPU programming:
- Memory hierarchy management
# Shared memory allocation
shared_mem = tb[dtype]().row_major[BM, BK]().shared().alloc()
# Register allocation
reg_tile = tb[dtype]().row_major[TM, TN]().local().alloc()
- Tiling strategies
# Block tiling
block_tile = tensor.tile[BM, BN](block_idx.y, block_idx.x)
# Register tiling
reg_tile = block_tile.tile[TM, TN](thread_row, thread_col)
- Memory access patterns
# Vectorized access
vec_tensor = tensor.vectorize[1, simd_width]()
# Asynchronous transfers
copy_dram_to_sram_async[thread_layout=layout](dst, src)
- Hardware acceleration
# Tensor Core operations (coming in later puzzles)
mma_op = TensorCore[dtype, out_type, Index(M, N, K)]()
result = mma_op.mma_op(a_reg, b_reg, c_reg)
š” Looking ahead: Through these puzzles, youāll learn to:
- Optimize data access with shared memory
- Implement efficient tiling strategies
- Leverage vectorized operations
- Utilize hardware accelerators
- Master memory access patterns
Each concept builds on the last, gradually taking you from basic tensor operations to advanced GPU programming. Ready to begin? Letās start with the fundamentals!
Quick example
Letās put everything together with a simple example that demonstrates the basics of LayoutTensor:
from gpu.host import DeviceContext
from layout import Layout, LayoutTensor
alias HEIGHT = 2
alias WIDTH = 3
alias dtype = DType.float32
alias layout = Layout.row_major(HEIGHT, WIDTH)
fn kernel[dtype: DType, layout: Layout](tensor: LayoutTensor[mut=True, dtype, layout]):
print("Before:")
print(tensor)
tensor[0, 0] += 1
print("After:")
print(tensor)
def main():
ctx = DeviceContext()
a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH).enqueue_fill(0)
tensor = LayoutTensor[mut=True, dtype, layout](a.unsafe_ptr())
# Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
ctx.enqueue_function[kernel[dtype, layout]](tensor, grid_dim=1, block_dim=1)
ctx.synchronize()
When we run this code with:
uv run poe layout_tensor_intro
pixi run layout_tensor_intro
Before:
0.0 0.0 0.0
0.0 0.0 0.0
After:
1.0 0.0 0.0
0.0 0.0 0.0
Letās break down whatās happening:
- We create a
2 x 3
tensor with row-major layout - Initially, all elements are zero
- Using natural indexing, we modify a single element
- The change is reflected in our output
This simple example demonstrates key LayoutTensor benefits:
- Clean syntax for tensor creation and access
- Automatic memory layout handling
- Built-in bounds checking
- Natural multi-dimensional indexing
While this example is straightforward, the same patterns will scale to complex GPU operations in upcoming puzzles. Youāll see how these basic concepts extend to:
- Multi-threaded GPU operations
- Shared memory optimizations
- Complex tiling strategies
- Hardware-accelerated computations
Ready to start your GPU programming journey with LayoutTensor? Letās dive into the puzzles!
š” Tip: Keep this example in mind as we progress - weāll build upon these fundamental concepts to create increasingly sophisticated GPU programs.