Introduction to LayoutTensor

Let’s take a quick break from solving puzzles to preview a powerful abstraction that will make our GPU programming journey more enjoyable: 🄁 … the LayoutTensor.

šŸ’” This is a motivational overview of LayoutTensor’s capabilities. Don’t worry about understanding everything now - we’ll explore each feature in depth as we progress through the puzzles.

The challenge: Growing complexity

Let’s look at the challenges we’ve faced so far:

# Puzzle 1: Simple indexing
out[i] = a[i] + 10.0

# Puzzle 2: Multiple array management
out[i] = a[i] + b[i]

# Puzzle 3: Bounds checking
if i < size:
    out[i] = a[i] + 10.0

As dimensions grow, code becomes more complex:

# Traditional 2D indexing for row-major 2D matrix
idx = row * WIDTH + col
if row < height and col < width:
    out[idx] = a[idx] + 10.0

The solution: A peek at LayoutTensor

LayoutTensor will help us tackle these challenges with elegant solutions. Here’s a glimpse of what’s coming:

  1. Natural Indexing: Use tensor[i, j] instead of manual offset calculations
  2. Automatic Bounds Checking: Built-in protection against out-of-bounds access
  3. Flexible Memory Layouts: Support for row-major, column-major, and tiled organizations
  4. Performance Optimization: Efficient memory access patterns for GPU

A taste of what’s ahead

Let’s look at a few examples of what LayoutTensor can do. Don’t worry about understanding all the details now - we’ll cover each feature thoroughly in upcoming puzzles.

Basic usage example

from layout import Layout, LayoutTensor

# Define layout
alias HEIGHT = 2
alias WIDTH = 3
alias layout = Layout.row_major(HEIGHT, WIDTH)

# Create tensor
tensor = LayoutTensor[dtype, layout](buffer.unsafe_ptr())

# Access elements naturally
tensor[0, 0] = 1.0  # First element
tensor[1, 2] = 2.0  # Last element

Preview of advanced features

As we progress through the puzzles, you’ll learn about:

  • Shared memory optimizations
  • Efficient tiling strategies
  • Vectorized operations
  • Hardware acceleration
  • Optimized memory access patterns
# Column-major layout
layout_col = Layout.col_major(HEIGHT, WIDTH)

# Tiled layout (for better cache utilization)
layout_tiled = tensor.tiled[4, 4](HEIGHT, WIDTH)

Each layout has its advantages:

  • Row-major: Elements in a row are contiguous

    # [1 2 3]
    # [4 5 6] -> [1 2 3 4 5 6]
    layout_row = Layout.row_major(2, 3)
    
  • Column-major: Elements in a column are contiguous

    # [1 2 3]
    # [4 5 6] -> [1 4 2 5 3 6]
    layout_col = Layout.col_major(2, 3)
    
  • Tiled: Elements grouped in tiles for cache efficiency

    # [[1 2] [3 4]] in 2x2 tiles
    layout_tiled = Layout.tiled[2, 2](4, 4)
    

Advanced GPU optimizations

As you progress, you’ll discover LayoutTensor’s powerful features for GPU programming:

  1. Memory hierarchy management
# Shared memory allocation
shared_mem = tb[dtype]().row_major[BM, BK]().shared().alloc()

# Register allocation
reg_tile = tb[dtype]().row_major[TM, TN]().local().alloc()
  1. Tiling strategies
# Block tiling
block_tile = tensor.tile[BM, BN](block_idx.y, block_idx.x)

# Register tiling
reg_tile = block_tile.tile[TM, TN](thread_row, thread_col)
  1. Memory access patterns
# Vectorized access
vec_tensor = tensor.vectorize[1, simd_width]()

# Asynchronous transfers
copy_dram_to_sram_async[thread_layout=layout](dst, src)
  1. Hardware acceleration
# Tensor Core operations (coming in later puzzles)
mma_op = TensorCore[dtype, out_type, Index(M, N, K)]()
result = mma_op.mma_op(a_reg, b_reg, c_reg)

šŸ’” Looking ahead: Through these puzzles, you’ll learn to:

  • Optimize data access with shared memory
  • Implement efficient tiling strategies
  • Leverage vectorized operations
  • Utilize hardware accelerators
  • Master memory access patterns

Each concept builds on the last, gradually taking you from basic tensor operations to advanced GPU programming. Ready to begin? Let’s start with the fundamentals!

Quick example

Let’s put everything together with a simple example that demonstrates the basics of LayoutTensor:

from gpu.host import DeviceContext
from layout import Layout, LayoutTensor

alias HEIGHT = 2
alias WIDTH = 3
alias dtype = DType.float32
alias layout = Layout.row_major(HEIGHT, WIDTH)

fn kernel[dtype: DType, layout: Layout](tensor: LayoutTensor[mut=True, dtype, layout]):
    print("Before:")
    print(tensor)
    tensor[0, 0] += 1
    print("After:")
    print(tensor)

def main():
    ctx = DeviceContext()

    a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH).enqueue_fill(0)
    tensor = LayoutTensor[mut=True, dtype, layout](a.unsafe_ptr())
    # Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
    ctx.enqueue_function[kernel[dtype, layout]](tensor, grid_dim=1, block_dim=1)

    ctx.synchronize()

When we run this code with:

uv run poe layout_tensor_intro
pixi run layout_tensor_intro
Before:
0.0 0.0 0.0
0.0 0.0 0.0
After:
1.0 0.0 0.0
0.0 0.0 0.0

Let’s break down what’s happening:

  1. We create a 2 x 3 tensor with row-major layout
  2. Initially, all elements are zero
  3. Using natural indexing, we modify a single element
  4. The change is reflected in our output

This simple example demonstrates key LayoutTensor benefits:

  • Clean syntax for tensor creation and access
  • Automatic memory layout handling
  • Built-in bounds checking
  • Natural multi-dimensional indexing

While this example is straightforward, the same patterns will scale to complex GPU operations in upcoming puzzles. You’ll see how these basic concepts extend to:

  • Multi-threaded GPU operations
  • Shared memory optimizations
  • Complex tiling strategies
  • Hardware-accelerated computations

Ready to start your GPU programming journey with LayoutTensor? Let’s dive into the puzzles!

šŸ’” Tip: Keep this example in mind as we progress - we’ll build upon these fundamental concepts to create increasingly sophisticated GPU programs.