Puzzle 18: 1D Convolution Op
From MAX Graph to PyTorch custom ops
We’re now entering Part IV of our GPU puzzle journey: PyTorch Custom Operations.
In Puzzle 15, we learned how to integrate Mojo GPU kernels with Python using MAX Graph. Now we’ll explore how to:
- Use the same Mojo kernel with PyTorch’s CustomOpLibrary
- Integrate with PyTorch’s tensor system and autograd
- Compare MAX Graph vs PyTorch approaches for custom operations
- Understand the critical pattern of explicit output tensor allocation
This transition shows how the same optimized GPU kernel can work with different Python integration approaches.
Overview
In this puzzle, we’ll take the exact same 1D convolution kernel from Puzzle 15 and integrate it with PyTorch using the CustomOpLibrary instead of MAX Graph.
The key learning here is that the same Mojo kernel works unchanged - only the Python integration layer differs between MAX Graph and PyTorch approaches.
Code to complete
To complete this puzzle, you need to fill in one line to call the custom operation:
import torch
from max.torch import CustomOpLibrary
def conv1d_pytorch(input_tensor: torch.Tensor, kernel_tensor: torch.Tensor) -> torch.Tensor:
"""
1D convolution using our custom PyTorch operation.
This demonstrates the transition from MAX Graph (p15) to PyTorch CustomOpLibrary.
Uses the EXACT same Mojo kernel, but different Python integration!
"""
# Load our custom operations
mojo_kernels = Path(__file__).parent / "op"
ops = CustomOpLibrary(mojo_kernels)
# Create output tensor with same shape as input
output_tensor = torch.empty_like(input_tensor)
# Call our custom conv1d operation with explicit output tensor
# The Mojo signature expects: (out, input, kernel)
conv1d = ops.conv1d[{"input_size": input_tensor.shape[0], "conv_size": kernel_tensor.shape[0]}]
# FILL IN with 1 line of code
return output_tensor
View full file: problems/p18/p18.py
You can run the puzzle with:
uv run poe p18
pixi run p18
When successful, you should see output similar to:
Puzzle 18: From MAX Graph to PyTorch Custom Ops
============================================================
Input array: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]
NumPy reference result: [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14. 0.]
Testing PyTorch Custom Op (device: cuda)
----------------------------------------
PyTorch custom op result: [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14. 0.]
âś… PyTorch custom op verification PASSED
Comparing with MAX Graph approach (like p15)
--------------------------------------------
MAX Graph result: [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14. 0.]
âś… MAX Graph verification PASSED
âś… PyTorch and MAX Graph results MATCH
Solution
The solution requires calling the compiled custom operation with the proper arguments:
# Call our custom conv1d operation with explicit output tensor
# The Mojo signature expects: (out, input, kernel)
conv1d = ops.conv1d[{"input_size": input_tensor.shape[0], "conv_size": kernel_tensor.shape[0]}]
torch.compile(conv1d)(output_tensor, input_tensor, kernel_tensor)
This solution demonstrates several critical concepts:
1. torch.compile() integration
The solution shows torch.compile
integration
torch.compile(conv1d)(output_tensor, input_tensor, kernel_tensor)
2. Explicit Output Tensor Allocation
output_tensor = torch.empty_like(input_tensor)
- Unlike MAX Graph which handles output allocation automatically
- PyTorch CustomOpLibrary requires pre-allocated output tensors
- The Mojo operation signature expects
(out, input, kernel)
order
3. Parameter Dictionary
ops.conv1d[{"input_size": input_tensor.shape[0], "conv_size": kernel_tensor.shape[0]}]
- Parameters are passed as a dictionary to the operation
- These become compile-time parameters in the Mojo kernel
- Must match the parameter names in the Mojo
@staticmethod fn execute
signature
4. Same Kernel, Different Integration
The underlying Mojo kernel (conv1d_kernel
) is identical to Puzzle 15:
- Same GPU kernel code
- Same memory access patterns
- Same computational logic
- Only the Python wrapper layer changes
Key concepts
This puzzle illustrates several important patterns for PyTorch custom operations:
Concept | MAX Graph (p15) | PyTorch CustomOpLibrary (p18) |
---|---|---|
Output Allocation | Automatic | Manual (torch.empty_like() ) |
Operation Call | ops.custom(...) | torch.compile(op)(...) |
Parameter Passing | parameters={...} | op[{...}] |
Device Management | Explicit device context | PyTorch tensor device |
Memory Management | MAX Graph tensors | PyTorch tensors |
Critical pattern: Explicit output tensor allocation
The most important difference is that PyTorch CustomOpLibrary requires explicit output tensor allocation:
# ❌ This won't work - no output tensor
result = torch.compile(conv1d)(input_tensor, kernel_tensor)
# âś… This works - pre-allocated output tensor
output_tensor = torch.empty_like(input_tensor)
torch.compile(conv1d)(output_tensor, input_tensor, kernel_tensor)
This pattern ensures:
- Memory is allocated on the correct device
- Output tensor has the right shape and dtype
- The Mojo kernel can write directly to the output buffer
torch.compile() integration
torch.compile()
is essential because it:
- Handles memory layout conversion between PyTorch and Mojo
- Manages device synchronization (CPU ↔ GPU)
- Optimizes tensor format conversion
- Provides proper error handling for memory operations
Note: Without torch.compile()
, you might encounter std::bad_alloc
errors because the raw operation can’t handle PyTorch’s tensor memory management.
Debugging custom operations
Common issues and solutions:
- Memory Allocation Errors: Always use
torch.compile()
- Wrong Output Shape: Ensure output tensor matches expected dimensions
- Device Mismatch: All tensors must be on the same device
- Parameter Errors: Verify parameter names match Mojo operation signature
The debug approach: Compare your PyTorch results with the MAX Graph reference implementation that runs the same kernel.