Puzzle 15: 1D Convolution Op
Bridging to Python with MAX Graph
We’re now entering Part III of our GPU puzzle journey: Interfacing with Python via MAX Graph Custom Ops.
In previous puzzles, we’ve learned how to write efficient GPU kernels in Mojo. Now we’ll explore how to:
- Package these kernels as custom operations that can be called from Python
- Integrate with the MAX Graph system for accelerated machine learning
- Bridge the gap between high-level Python APIs and low-level GPU code
This integration allows us to leverage the performance of Mojo GPU kernels while working in familiar Python environments.
Overview
In Puzzle 11, we implemented a 1D convolution kernel that runs efficiently on the GPU. Now we’ll take this kernel and transform it into a custom operation that can be called directly from Python using MAX Graph.
The 1D convolution kernel we’ll be working with is already implemented:
alias TPB = 15
alias BLOCKS_PER_GRID = (2, 1)
fn conv1d_kernel[
in_layout: Layout,
out_layout: Layout,
conv_layout: Layout,
input_size: Int,
conv_size: Int,
dtype: DType = DType.float32,
](
out: LayoutTensor[mut=True, dtype, out_layout],
input: LayoutTensor[mut=True, dtype, in_layout],
kernel: LayoutTensor[mut=True, dtype, conv_layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# first: need to account for padding
shared_a = tb[dtype]().row_major[TPB + conv_size - 1]().shared().alloc()
shared_b = tb[dtype]().row_major[conv_size]().shared().alloc()
if global_i < input_size:
shared_a[local_i] = input[global_i]
# second: load elements needed for convolution at block boundary
if local_i < conv_size - 1:
# indices from next block
next_idx = global_i + TPB
if next_idx < input_size:
shared_a[TPB + local_i] = input[next_idx]
if local_i < conv_size:
shared_b[local_i] = kernel[local_i]
barrier()
if global_i < input_size:
var local_sum: out.element_type = 0
@parameter
for j in range(conv_size):
if local_i + j < TPB + conv_size - 1:
local_sum += shared_a[local_i + j] * shared_b[j]
out[global_i] = local_sum
The key aspects of this puzzle include:
- Custom op registration: Understanding how to expose Mojo functions to Python via the
@compiler.register
decorator - Packaging custom ops: Learning how to package Mojo code for use with MAX Graph
- Python integration: Calling custom operations from Python through MAX Graph
- Cross-language data flow: Managing data types and memory between Python and GPU
This custom operation will:
- Accept NumPy arrays as input from Python
- Transfer this data to the GPU
- Execute our optimized convolution kernel
- Return the results back to Python
When you complete this puzzle, you’ll have created a seamless bridge between Python’s rich ecosystem and Mojo’s powerful GPU performance.
Code to complete
To complete this puzzle, you only need to fill one line to call the conv1d_kernel
:
import compiler
from runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer
from gpu.host import DeviceBuffer
@compiler.register("conv1d")
struct Conv1DCustomOp:
@staticmethod
fn execute[
# The kind of device this will be run on: "cpu" or "gpu"
target: StaticString,
input_size: Int,
conv_size: Int,
dtype: DType = DType.float32,
](
out: OutputTensor[rank=1],
input: InputTensor[type = out.type, rank = out.rank],
kernel: InputTensor[type = out.type, rank = out.rank],
# the context is needed for some GPU calls
ctx: DeviceContextPtr,
) raises:
out_tensor = out.to_layout_tensor()
input_tensor = input.to_layout_tensor()
kernel_tensor = kernel.to_layout_tensor()
alias in_layout = input_tensor.layout
alias out_layout = out_tensor.layout
alias conv_layout = kernel_tensor.layout
@parameter
if target == "gpu":
gpu_ctx = ctx.get_device_context()
# making sure the output tensor is zeroed out before the kernel is called
gpu_ctx.enqueue_memset(
DeviceBuffer[out.type](
gpu_ctx,
rebind[UnsafePointer[Scalar[out.type]]](out_tensor.ptr),
input_size,
owning=False,
),
0,
)
# FILL ME IN with 1 line calling our conv1d_kernel
elif target == "cpu":
# we can fallback to CPU
pass
else:
raise Error("Unsupported target: " + target)
View full file: problems/p15/op/conv1d.mojo
You can run the puzzle with:
uv run poe p15
pixi run p15
When successful, you should see output similar to:
Input array: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]
Expected result (NumPy calculation): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14. 0.]
Compiling 1D convolution graph...
Executing 1D convolution...
1D Convolution result (custom Mojo kernel): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14. 0.]
Verification passed: Custom kernel results match NumPy calculation
This indicates that your custom MAX Graph operation correctly implements the 1D convolution algorithm.
Solution
To solve this puzzle, we need to integrate our 1D convolution kernel with the MAX Graph system. The key is to properly call our kernel from the execute
method in the Conv1DCustomOp
struct.
The solution is:
gpu_ctx.enqueue_function[
conv1d_kernel[
in_layout, out_layout, conv_layout, input_size, conv_size
]
](
output_tensor,
input_tensor,
kernel_tensor,
grid_dim=BLOCKS_PER_GRID,
block_dim=(TPB, 1),
)
- Calls enqueue_function on the GPU context (
gpu_ctx
is of type DeviceContext) to schedule our kernel execution - Passes the necessary layout and size information as compile-time parameters
- Provides the output, input, and kernel tensors as runtime arguments
- Configures the execution grid with the appropriate dimensions
Let’s break down how this works in the larger context:
Python-Mojo integration flow
-
Python side (problems/p15/p15.py):
- Creates NumPy arrays for input and kernel
- Calls
conv_1d()
function which wraps our operation in MAX Graph - Converts NumPy arrays to MAX driver Tensors with
Tensor.from_numpy(input).to(device)
- Loads the custom operation package with
custom_extensions=[mojo_kernels]
-
Graph building:
- Defines input and output tensor types with TensorType
- Specifies parameters for our operation via
parameters={...}
- Creates a computation graph with
Graph("conv_1d_graph", ...)
- Calls our operation using
ops.custom(name="conv1d", ...)
-
Custom op registration:
- The
@compiler.register("conv1d")
decorator exposes our operation to MAX Graph. See @compiler.register - The
execute
method parameters define the interface (inputs, outputs, context) - Input/output tensors are converted to LayoutTensors for use in our kernel
- Device context manages GPU memory allocation and kernel execution
- The
-
Kernel execution:
- When model.execute(…) is called, our
conv1d_kernel
receives the data - GPU thread configuration is set with
grid_dim
andblock_dim
- Results are transferred back to CPU with
result.to(CPU())
- NumPy verification compares our results with the expected output
- When model.execute(…) is called, our
Key Components in Detail
-
Custom Op Structure:
@compiler.register("conv1d") struct Conv1DCustomOp: @staticmethod fn execute[target: StaticString, input_size: Int, conv_size: Int, dtype: DType = DType.float32]( out: OutputTensor[rank=1], input: InputTensor[type = out.type, rank = out.rank], kernel: InputTensor[type = out.type, rank = out.rank], ctx: DeviceContextPtr, ) raises: # Implementation
target
indicates the device type (“gpu” or “cpu”)input_size
andconv_size
are parameters passed from Python- Tensor types ensure correct shape and type checking
- Return type is
raises
for proper error handling
-
Tensor Conversion:
out_tensor = out.to_layout_tensor() input_tensor = input.to_layout_tensor() kernel_tensor = kernel.to_layout_tensor()
- MAX Graph tensors are converted to Mojo LayoutTensors
- This allows our kernel to work with them directly
- The layouts are extracted for compile-time optimization
-
Device Context Usage:
gpu_ctx = ctx.get_device_context() gpu_ctx.enqueue_memset(...) # Zero output buffer gpu_ctx.enqueue_function[...](...) # Schedule kernel
- Device context manages GPU resources
- Memory operations ensure correct buffer state
- Function enqueueing schedules our kernel for execution
This solution demonstrates the complete flow from Python data through MAX Graph to GPU execution and back, leveraging Mojo’s powerful type system and parametric functions to create efficient, type-safe, accelerated operations.
Understanding MAX Graph custom ops
Check out the follow tutorials for more details:
Custom op registration
The core of creating a custom operation is the @compiler.register
decorator and the associated structure:
@compiler.register("conv1d")
struct Conv1DCustomOp:
@staticmethod
fn execute[...](
out: OutputTensor[rank=1],
input: InputTensor[type = out.type, rank = out.rank],
kernel: InputTensor[type = out.type, rank = out.rank],
ctx: DeviceContextPtr,
) raises:
# Implementation here
Key components of the registration:
- The name passed to the decorator (
"conv1d"
) is what Python code will use to call this operation - The struct must have an
execute
method with the correct signature - OutputTensor and InputTensor types define the interface for Python data
- DeviceContextPtr provides access to the execution environment
Packaging custom ops
Before the custom operation can be used from Python, it needs to be packaged:
mojo package op -o op.mojopkg
This command:
- Compiles the Mojo code into a deployable package
- Creates the necessary metadata for MAX Graph to understand the operation
- Produces a binary artifact (
op.mojopkg
) that can be loaded by Python
The package must be placed in a location where MAX Graph can find it, typically in a directory accessible to the Python code.
Python integration
On the Python side, here’s how the custom operation is used:
# Path to the directory containing our Mojo operations
mojo_kernels = Path(__file__).parent / "op"
# Configure our graph with the custom conv1d operation
with Graph(
"conv_1d_graph",
input_types=[...],
custom_extensions=[mojo_kernels], # Load our custom op package
) as graph:
# Define inputs to the graph
input_value, kernel_value = graph.inputs
# Use our custom operation by name
output = ops.custom(
name="conv1d", # Must match the name in @compiler.register
values=[input_value, kernel_value],
out_types=[...],
parameters={
"input_size": input_tensor.shape[0],
"conv_size": kernel_tensor.shape[0],
"dtype": dtype,
},
)[0].tensor
The key elements are:
- Specifying the path to our custom operations with
custom_extensions
- Calling
ops.custom
with the registered operation name - Passing input values and parameters that match our operation’s signature