Puzzle 15: 1D Convolution Op

Bridging to Python with MAX Graph

We’re now entering Part III of our GPU puzzle journey: Interfacing with Python via MAX Graph Custom Ops.

In previous puzzles, we’ve learned how to write efficient GPU kernels in Mojo. Now we’ll explore how to:

Package these kernels as custom operations that can be called from Python

Integrate with the MAX Graph system for accelerated machine learning

Bridge the gap between high-level Python APIs and low-level GPU code

This integration allows us to leverage the performance of Mojo GPU kernels while working in familiar Python environments.

Overview

In Puzzle 11, we implemented a 1D convolution kernel that runs efficiently on the GPU. Now we’ll take this kernel and transform it into a custom operation that can be called directly from Python using MAX Graph.

The 1D convolution kernel we’ll be working with is already implemented:

alias TPB = 15
alias BLOCKS_PER_GRID = (2, 1)


fn conv1d_kernel[
    in_layout: Layout,
    out_layout: Layout,
    conv_layout: Layout,
    input_size: Int,
    conv_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[mut=True, dtype, out_layout],
    input: LayoutTensor[mut=True, dtype, in_layout],
    kernel: LayoutTensor[mut=True, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # first: need to account for padding
    shared_a = tb[dtype]().row_major[TPB + conv_size - 1]().shared().alloc()
    shared_b = tb[dtype]().row_major[conv_size]().shared().alloc()
    if global_i < input_size:
        shared_a[local_i] = input[global_i]

    # second: load elements needed for convolution at block boundary
    if local_i < conv_size - 1:
        # indices from next block
        next_idx = global_i + TPB
        if next_idx < input_size:
            shared_a[TPB + local_i] = input[next_idx]

    if local_i < conv_size:
        shared_b[local_i] = kernel[local_i]

    barrier()

    if global_i < input_size:
        var local_sum: output.element_type = 0

        @parameter
        for j in range(conv_size):
            if local_i + j < TPB + conv_size - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum

The key aspects of this puzzle include:

Custom op registration: Understanding how to expose Mojo functions to Python via the @compiler.register decorator
Packaging custom ops: Learning how to package Mojo code for use with MAX Graph
Python integration: Calling custom operations from Python through MAX Graph
Cross-language data flow: Managing data types and memory between Python and GPU

This custom operation will:

Accept NumPy arrays as input from Python
Transfer this data to the GPU
Execute our optimized convolution kernel
Return the results back to Python

When you complete this puzzle, you’ll have created a seamless bridge between Python’s rich ecosystem and Mojo’s powerful GPU performance.

Code to complete

To complete this puzzle, you only need to fill one line to call the conv1d_kernel:

import compiler
from runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer
from gpu.host import DeviceBuffer


@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    fn execute[
        # The kind of device this will be run on: "cpu" or "gpu"
        target: StaticString,
        input_size: Int,
        conv_size: Int,
        dtype: DType = DType.float32,
    ](
        output: OutputTensor[rank=1],
        input: InputTensor[rank = output.rank],
        kernel: InputTensor[rank = output.rank],
        # the context is needed for some GPU calls
        ctx: DeviceContextPtr,
    ) raises:
        output_tensor = output.to_layout_tensor()
        input_tensor = input.to_layout_tensor()
        kernel_tensor = kernel.to_layout_tensor()
        alias in_layout = input_tensor.layout
        alias output_layout = output_tensor.layout
        alias conv_layout = kernel_tensor.layout

        @parameter
        if target == "gpu":
            gpu_ctx = ctx.get_device_context()
            # making sure the output tensor is zeroed out before the kernel is called
            gpu_ctx.enqueue_memset(
                DeviceBuffer[output_tensor.dtype](
                    gpu_ctx,
                    rebind[UnsafePointer[Scalar[output_tensor.dtype]]](
                        output_tensor.ptr
                    ),
                    input_size,
                    owning=False,
                ),
                0,
            )

            # FILL ME IN with 1 line calling our conv1d_kernel

        elif target == "cpu":
            # we can fallback to CPU
            pass
        else:
            raise Error("Unsupported target: " + target)

View full file: problems/p15/op/conv1d.mojo

You can run the puzzle with:

uv run poe p15

pixi run p15

When successful, you should see output similar to:

Input array: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]
Expected result (NumPy calculation): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Compiling 1D convolution graph...
Executing 1D convolution...
1D Convolution result (custom Mojo kernel): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Verification passed: Custom kernel results match NumPy calculation

This indicates that your custom MAX Graph operation correctly implements the 1D convolution algorithm.

Solution

To solve this puzzle, we need to integrate our 1D convolution kernel with the MAX Graph system. The key is to properly call our kernel from the execute method in the Conv1DCustomOp struct.

The solution is:

            gpu_ctx.enqueue_function[
                conv1d_kernel[
                    in_layout, out_layout, conv_layout, input_size, conv_size
                ]
            ](
                output_tensor,
                input_tensor,
                kernel_tensor,
                grid_dim=BLOCKS_PER_GRID,
                block_dim=(TPB, 1),
            )

This single line does several important things:

Calls enqueue_function on the GPU context (gpu_ctx is of type DeviceContext) to schedule our kernel execution
Passes the necessary layout and size information as compile-time parameters
Provides the output, input, and kernel tensors as runtime arguments
Configures the execution grid with the appropriate dimensions

Let’s break down how this works in the larger context:

Python-Mojo integration flow

Python side (problems/p15/p15.py):
- Creates NumPy arrays for input and kernel
- Calls conv_1d() function which wraps our operation in MAX Graph
- Converts NumPy arrays to MAX driver Tensors with Tensor.from_numpy(input).to(device)
- Loads the custom operation package with custom_extensions=[mojo_kernels]
Graph building:
- Defines input and output tensor types with TensorType
- Specifies parameters for our operation via parameters={...}
- Creates a computation graph with Graph("conv_1d_graph", ...)
- Calls our operation using ops.custom(name="conv1d", ...)
Custom op registration:
- The @compiler.register("conv1d") decorator exposes our operation to MAX Graph. See @compiler.register
- The execute method parameters define the interface (inputs, outputs, context)
- Input/output tensors are converted to LayoutTensors for use in our kernel
- Device context manages GPU memory allocation and kernel execution
Kernel execution:
- When model.execute(…) is called, our conv1d_kernel receives the data
- GPU thread configuration is set with grid_dim and block_dim
- Results are transferred back to CPU with result.to(CPU())
- NumPy verification compares our results with the expected output

Key Components in Detail

Custom Op Structure:

@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    fn execute[target: StaticString, input_size: Int, conv_size: Int, dtype: DType = DType.float32](
        output: OutputTensor[rank=1],
        input: InputTensor[dtype = output.dtype, rank = output.rank],
        kernel: InputTensor[dtype = output.dtype, rank = output.rank],
        ctx: DeviceContextPtr,
    ) raises:
        # Implementation

target indicates the device type (“gpu” or “cpu”)
input_size and conv_size are parameters passed from Python
Tensor types ensure correct shape and type checking
Return type is raises for proper error handling

Tensor Conversion:
```
output_tensor = output.to_layout_tensor()
input_tensor = input.to_layout_tensor()
kernel_tensor = kernel.to_layout_tensor()
```
- MAX Graph tensors are converted to Mojo LayoutTensors
- This allows our kernel to work with them directly
- The layouts are extracted for compile-time optimization
Device Context Usage:
```
gpu_ctx = ctx.get_device_context()
gpu_ctx.enqueue_memset(...)  # Zero output buffer
gpu_ctx.enqueue_function[...](...) # Schedule kernel
```
- Device context manages GPU resources
- Memory operations ensure correct buffer state
- Function enqueueing schedules our kernel for execution

This solution demonstrates the complete flow from Python data through MAX Graph to GPU execution and back, leveraging Mojo’s powerful type system and parametric functions to create efficient, type-safe, accelerated operations.

Understanding MAX Graph custom ops

Check out the follow tutorials for more details:

Get started with MAX Graph in Python

MAX Graph custom op for GPUs

Custom op registration

The core of creating a custom operation is the @compiler.register decorator and the associated structure:

@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    fn execute[...](
        output: OutputTensor[rank=1],
        input: InputTensor[dtype = output.dtype, rank = output.rank],
        kernel: InputTensor[type = output.dtype, rank = output.rank],
        ctx: DeviceContextPtr,
    ) raises:
        # Implementation here

Key components of the registration:

The name passed to the decorator ("conv1d") is what Python code will use to call this operation
The struct must have an execute method with the correct signature
OutputTensor and InputTensor types define the interface for Python data
DeviceContextPtr provides access to the execution environment

Packaging custom ops

Before the custom operation can be used from Python, it needs to be packaged:

mojo package op -o op.mojopkg

This command:

Compiles the Mojo code into a deployable package
Creates the necessary metadata for MAX Graph to understand the operation
Produces a binary artifact (op.mojopkg) that can be loaded by Python

The package must be placed in a location where MAX Graph can find it, typically in a directory accessible to the Python code.

Python integration

On the Python side, here’s how the custom operation is used:

# Path to the directory containing our Mojo operations
mojo_kernels = Path(__file__).parent / "op"

# Configure our graph with the custom conv1d operation
with Graph(
    "conv_1d_graph",
    input_types=[...],
    custom_extensions=[mojo_kernels],  # Load our custom op package
) as graph:
    # Define inputs to the graph
    input_value, kernel_value = graph.inputs

    # Use our custom operation by name
    output = ops.custom(
        name="conv1d",  # Must match the name in @compiler.register
        values=[input_value, kernel_value],
        out_types=[...],
        parameters={
            "input_size": input_tensor.shape[0],
            "conv_size": kernel_tensor.shape[0],
            "dtype": dtype,
        },
    )[0].tensor

The key elements are:

Specifying the path to our custom operations with custom_extensions
Calling ops.custom with the registered operation name
Passing input values and parameters that match our operation’s signature