Naïve Matrix Multiplication

Overview

Implement a kernel that multiplies square matrices \(A\) and \(B\) and stores the result in \(\text{output}\). This is the most straightforward implementation where each thread computes one element of the output matrix.

Key concepts

In this puzzle, you’ll learn about:

2D thread organization for matrix operations
Global memory access patterns
Matrix indexing in row-major layout
Thread-to-output element mapping

The key insight is understanding how to map 2D thread indices to matrix elements and compute dot products in parallel.

Configuration

Matrix size: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
Grid dimensions: \(1 \times 1\)

Layout configuration:

Input A: Layout.row_major(SIZE, SIZE)
Input B: Layout.row_major(SIZE, SIZE)
Output: Layout.row_major(SIZE, SIZE)

Code to complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb


alias TPB = 3
alias SIZE = 2
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, TPB)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE, SIZE)


fn naive_matmul[
    layout: Layout, size: Int
](
    output: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 6 lines)

View full file: problems/p14/p14.mojo

Tips

Calculate row and col from thread indices
Check if indices are within size
Accumulate products in a local variable
Write final sum to correct output position

Running the code

To test your solution, run the following command in your terminal:

uv run poe p14 --naive

pixi run p14 --naive

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([4.0, 6.0, 12.0, 22.0])

Solution

fn naive_matmul[
    layout: Layout, size: Int
](
    output: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x

    if row < size and col < size:
        var acc: output.element_type = 0

        @parameter
        for k in range(size):
            acc += a[row, k] * b[k, col]

        output[row, col] = acc

The naive matrix multiplication using LayoutTensor demonstrates the basic approach:

Matrix Layout (2×2 example)

Matrix A:          Matrix B:                   Output C:
[a[0,0] a[0,1]]    [b[0,0] b[0,1]]             [c[0,0] c[0,1]]
[a[1,0] a[1,1]]    [b[1,0] b[1,1]]             [c[1,0] c[1,1]]

Implementation Details:

Thread mapping:

row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x

Memory access pattern:
- Direct 2D indexing: a[row, k]
- Transposed access: b[k, col]
- Output writing: output[row, col]

Computation flow:

# Use var for mutable accumulator with tensor's element type
var acc: output.element_type = 0

# @parameter for compile-time loop unrolling
@parameter
for k in range(size):
    acc += a[row, k] * b[k, col]

Key language features:

Variable declaration:
- The use of var in var acc: output.element_type = 0 allows for type inference with output.element_type ensures type compatibility with the output tensor
- Initialized to zero before accumulation
Loop pptimization:
- @parameter decorator unrolls the loop at compile time
- Improves performance for small, known matrix sizes
- Enables better instruction scheduling

Performance characteristics:

Memory access:
- Each thread makes 2 x SIZE global memory reads
- One global memory write per thread
- No data reuse between threads
Computational efficiency:
- Simple implementation but suboptimal performance
- Many redundant global memory accesses
- No use of fast shared memory
Limitations:
- High global memory bandwidth usage
- Poor data locality
- Limited scalability for large matrices

This naive implementation serves as a baseline for understanding matrix multiplication on GPUs, highlighting the need for optimization in memory access patterns.