Naïve Matrix Multiplication
Overview
Implement a kernel that multiplies square matrices \(A\) and \(B\) and stores the result in \(\text{out}\). This is the most straightforward implementation where each thread computes one element of the output matrix.
Key concepts
In this puzzle, you’ll learn about:
- 2D thread organization for matrix operations
- Global memory access patterns
- Matrix indexing in row-major layout
- Thread-to-output element mapping
The key insight is understanding how to map 2D thread indices to matrix elements and compute dot products in parallel.
Configuration
- Matrix size: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
- Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
- Grid dimensions: \(1 \times 1\)
Layout configuration:
- Input A:
Layout.row_major(SIZE, SIZE)
- Input B:
Layout.row_major(SIZE, SIZE)
- Output:
Layout.row_major(SIZE, SIZE)
Code to complete
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
alias TPB = 3
alias SIZE = 2
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, TPB)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE, SIZE)
fn naive_matmul[
layout: Layout, size: Int
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
):
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x
# FILL ME IN (roughly 6 lines)
View full file: problems/p14/p14.mojo
Tips
- Calculate
row
andcol
from thread indices - Check if indices are within
size
- Accumulate products in a local variable
- Write final sum to correct output position
Running the code
To test your solution, run the following command in your terminal:
uv run poe p14 --naive
pixi run p14 --naive
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([4.0, 6.0, 12.0, 22.0])
Solution
fn naive_matmul[
layout: Layout, size: Int
](
output: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
):
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x
if row < size and col < size:
var acc: output.element_type = 0
@parameter
for k in range(size):
acc += a[row, k] * b[k, col]
output[row, col] = acc
The naive matrix multiplication using LayoutTensor demonstrates the basic approach:
Matrix Layout (2×2 example)
Matrix A: Matrix B: Output C:
[a[0,0] a[0,1]] [b[0,0] b[0,1]] [c[0,0] c[0,1]]
[a[1,0] a[1,1]] [b[1,0] b[1,1]] [c[1,0] c[1,1]]
Implementation Details:
-
Thread mapping:
row = block_dim.y * block_idx.y + thread_idx.y col = block_dim.x * block_idx.x + thread_idx.x
-
Memory access pattern:
- Direct 2D indexing:
a[row, k]
- Transposed access:
b[k, col]
- Output writing:
out[row, col]
- Direct 2D indexing:
-
Computation flow:
# Use var for mutable accumulator with tensor's element type var acc: out.element_type = 0 # @parameter for compile-time loop unrolling @parameter for k in range(size): acc += a[row, k] * b[k, col]
Key language features:
-
Variable declaration:
- The use of
var
invar acc: out.element_type = 0
allows for type inference without.element_type
ensures type compatibility with the output tensor - Initialized to zero before accumulation
- The use of
-
Loop pptimization:
@parameter
decorator unrolls the loop at compile time- Improves performance for small, known matrix sizes
- Enables better instruction scheduling
Performance characteristics:
-
Memory access:
- Each thread makes
2 x SIZE
global memory reads - One global memory write per thread
- No data reuse between threads
- Each thread makes
-
Computational efficiency:
- Simple implementation but suboptimal performance
- Many redundant global memory accesses
- No use of fast shared memory
-
Limitations:
- High global memory bandwidth usage
- Poor data locality
- Limited scalability for large matrices
This naive implementation serves as a baseline for understanding matrix multiplication on GPUs, highlighting the need for optimization in memory access patterns.