Puzzle 16: Softmax Op

Overview

In this puzzle, we’ll implement the softmax function as a custom MAX Graph operation. Softmax takes a vector of real numbers and normalizes it into a probability distribution.

Mathematically, the softmax function is defined as:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$$

Where:

  • \(x_i\) is the \(i\)-th element of the input vector
  • \(n\) is the length of the input vector

However, this direct implementation can lead to numerical overflow issues when values are large. To address this, we use a more numerically stable version:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

Our GPU implementation uses parallel reduction for both finding the maximum value and computing the sum of exponentials, making it highly efficient for large vectors.

Key concepts

  • Parallel reduction for efficient maximum and sum calculations
  • Numerical stability through max-subtraction technique
  • Shared memory usage for thread communication
  • Custom MAX Graph operation integration with Python
  • Thread synchronization with barriers

Configuration

  • Vector size: \(\text{SIZE} = 128\)
  • Threads per block: \(\text{TPB} = 128\)
  • Grid dimensions: \(1 \times 1\) block
  • Shared memory: Two shared variables for max and sum

Layout configuration:

  • Input tensor: Layout.row_major(SIZE)
  • Output tensor: Layout.row_major(SIZE)
  • Custom op parameters: {"input_size": input_tensor.shape[0]}

Key aspects of this puzzle include:

  1. Numerical stability: Understanding how to handle potential numerical issues
  2. Parallel reductions: Using shared memory for efficient max and sum calculations
  3. Custom op integration: Completing the Python interface for our Mojo GPU kernel
  4. Testing and verification: Ensuring our implementation matches the expected results

Our softmax custom operation will:

  • Accept NumPy arrays from Python
  • Process them efficiently on the GPU
  • Return normalized probability distributions
  • Match the results of SciPy’s softmax implementation

Code to complete

To complete this puzzle, you need to implement both the GPU and CPU kernels in the Mojo file and complete the graph definition in the Python code.

1. Implement the GPU kernel:

from gpu import thread_idx, block_idx, block_dim, barrier
from gpu.host import DeviceContext, HostBuffer, DeviceBuffer
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
from math import exp
from utils.numerics import max_finite, min_finite


alias SIZE = 128
alias TPB = 128
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias layout = Layout.row_major(SIZE)
alias dtype = DType.float32


fn softmax_gpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[mut=True, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
):
    # FILL IN (roughly 31 lines)
    ...


View full file: problems/p16/op/softmax.mojo

Tips
  1. Use shared memory for both the maximum value and sum to ensure all threads can access these values
  2. Remember to call barrier() at appropriate points to synchronize threads
  3. Implement parallel reduction by having each thread process a portion of the input array
  4. Use a tree-based reduction pattern to minimize thread divergence
  5. Handle out-of-bounds access carefully, especially for large inputs
  6. For numerical stability, calculate \(e^{x_i - max}\) instead of \(e^{x_i}\)

2. Implement the CPU kernel:

fn softmax_cpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[dtype, layout, MutableAnyOrigin],
    input: LayoutTensor[dtype, layout, MutableAnyOrigin],
):
    # FILL IN (roughly 10 lines)
    ...


View full file: problems/p16/op/softmax.mojo

Tips
  1. Create a sequential implementation that follows the same mathematical steps as the GPU version
  2. First find the maximum value across all inputs
  3. Then compute \(e^{x_i - max}\) for each element and accumulate the sum
  4. Finally, normalize by dividing each element by the sum
  5. Use scalar operations since we don’t have parallel threads in the CPU implementation

Test the CPU and GPU kernels

uv run poe p16-test-kernels
pixi run p16-test-kernels

when done correctly you’ll see

Total Discovered Tests: 1

Passed : 1 (100.00%)
Failed : 0 (0.00%)
Skipped: 0 (0.00%)

3. Complete the graph definition:

from pathlib import Path
import numpy as np
from max.driver import CPU, Accelerator, Device, Tensor, accelerator_count
from max.dtype import DType
from max.engine import InferenceSession
from max.graph import DeviceRef, Graph, TensorType, ops
from numpy.typing import NDArray
from scipy.special import softmax as scipy_softmax


def softmax(
    input: NDArray[np.float32],
    session: InferenceSession,
    device: Device,
) -> Tensor:
    dtype = DType.float32
    input_tensor = Tensor.from_numpy(input).to(device)
    mojo_kernels = Path(__file__).parent / "op"

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
        # FILL IN (roughly 4 unformatted lines)
        pass

View full file: problems/p16/p16.py

Tips
  1. Use graph.inputs[0] to access the input tensor passed to the graph
  2. Call ops.custom() with the name matching your registered custom op (“softmax”)
  3. Pass the input tensor as a value to the custom operation
  4. Specify the output type to match the input shape
  5. Include the “input_size” parameter which is required by the kernel
  6. Set graph.outputs to a list containing your operation’s output tensor

You can run the puzzle with:

uv run poe p16
pixi run p16

When successful, you should see output similar to on CPU and GPU:

Input shape: (128,)
First few random input values: [ 1.1810775   0.60472375  0.5718309   0.6644599  -0.08899796]
Compiling softmax graph on Device(type=cpu,id=0)
Executing softmax on Device(type=cpu,id=0)
====================================================================================================
Compiling softmax graph on Device(type=gpu,id=0)
Executing softmax on Device(type=gpu,id=0)
====================================================================================================
First few softmax results on CPU (custom Mojo kernel): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
First few softmax results on GPU (custom Mojo kernel): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
First few expected results (SciPy calculation): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
Verification passed: Custom kernel results match SciPy calculation
Sum of all probabilities on CPU: 1.0
Sum of all probabilities on GPU: 1.0

This indicates that your custom MAX Graph operation correctly implements the softmax algorithm and produces a valid probability distribution.

Solution

To solve this puzzle, we need to implement both the Mojo kernels (GPU and CPU) and the Python graph definition for our softmax custom operation. Similar to what we did in Puzzle 15, we’re creating a bridge between Python’s ecosystem and Mojo’s GPU-accelerated computing capabilities.

The softmax operation we’re implementing is mathematically defined as:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$$

However, to prevent numerical overflow, we use the more stable form:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

GPU kernel implementation:

fn softmax_gpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[mut=True, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
):
    shared_max = tb[dtype]().row_major[TPB]().shared().alloc()
    shared_sum = tb[dtype]().row_major[TPB]().shared().alloc()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    var thread_max: Scalar[dtype] = min_finite[dtype]()
    if global_i < input_size:
        thread_max = rebind[Scalar[dtype]](input[global_i])

    shared_max[local_i] = thread_max
    barrier()

    # Parallel reduction to find max
    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared_max[local_i] = max(
                shared_max[local_i], shared_max[local_i + stride]
            )
        barrier()
        stride = stride // 2

    block_max = shared_max[0]

    var exp_val: Scalar[dtype] = 0.0
    if global_i < input_size:
        exp_val = rebind[Scalar[dtype]](exp(input[global_i] - block_max))
        output[global_i] = exp_val

    shared_sum[local_i] = exp_val
    barrier()

    # Parallel reduction for sum
    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared_sum[local_i] += shared_sum[local_i + stride]
        barrier()
        stride = stride // 2

    block_sum = shared_sum[0]

    # Normalize by sum
    if global_i < input_size:
        output[global_i] = output[global_i] / block_sum


Our GPU implementation implements the numerically stable softmax algorithm with highly optimized parallel reduction techniques. Let's dissect the kernel in detail:

Kernel signature and memory management

fn softmax_gpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    out: LayoutTensor[mut=True, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
)

The kernel is parameterized with:

  • Common layout parameter for both input and output tensors
  • Vector size as an Integer parameter
  • Configurable data type with float32 as default
  • Mutable output tensor for in-place computation
  • Non-mutable input tensor (mut=False)

Shared memory allocation

shared_max = tb[dtype]().row_major[TPB]().shared().alloc()
shared_sum = tb[dtype]().row_major[TPB]().shared().alloc()

The kernel allocates two shared memory buffers:

  • shared_max: For parallel maximum finding reduction
  • shared_sum: For parallel sum computation
  • Both use TPB (Threads Per Block = 128) as their size
  • Shared memory provides fast access for all threads within a block

Thread indexing

global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x

Each thread computes:

  • global_i: Its global index in the entire computation space
  • local_i: Its local index within the current thread block This mapping ensures each thread processes exactly one input element.

Maximum-finding phase

var thread_max: Scalar[dtype] = min_finite[dtype]()
if global_i < input_size:
    thread_max = rebind[Scalar[dtype]](input[global_i])

shared_max[local_i] = thread_max
barrier()

This initializes each thread with:

  • The minimum finite value for elements outside the valid range
  • The actual input value for threads that map to valid elements
  • Storage in shared memory for the reduction process
  • A barrier synchronization to ensure all threads complete memory writes

Parallel max reduction

stride = TPB // 2
while stride > 0:
    if local_i < stride:
        shared_max[local_i] = max(shared_max[local_i], shared_max[local_i + stride])
    barrier()
    stride = stride // 2

This implements a parallel tree-reduction pattern:

  1. Start with stride = 64 (half of TPB)
  2. Each active thread compares two values separated by the stride
  3. Store the maximum in the lower index
  4. Synchronize all threads with a barrier
  5. Halve the stride and repeat
  6. After \(\log_2(TPB)\) steps, shared_max[0] contains the global maximum

This logarithmic reduction is significantly faster than a linear scan on large inputs.

Exponentiation with numerical stability

block_max = shared_max[0]

var exp_val: Scalar[dtype] = 0.0
if global_i < input_size:
    exp_val = rebind[Scalar[dtype]](exp(input[global_i] - block_max))
    out[global_i] = exp_val

Each thread:

  1. Reads the global maximum from shared memory
  2. Subtracts it from its input value before taking the exponential
  3. This subtraction is crucial for numerical stability - it prevents overflow
  4. The largest exponent becomes \(e^0 = 1\), and all others are \(e^{negative} < 1\)
  5. Stores the intermediate result in the output buffer

Parallel sum reduction

shared_sum[local_i] = exp_val
barrier()

stride = TPB // 2
while stride > 0:
    if local_i < stride:
        shared_sum[local_i] += shared_sum[local_i + stride]
    barrier()
    stride = stride // 2

The second reduction phase:

  1. Stores all exponential values in shared memory
  2. Uses the same tree-based reduction pattern as for max
  3. But performs addition instead of maximum comparison
  4. After \(\log_2(TPB)\) steps, shared_sum[0] contains the total sum of all exponentials

Final normalization

block_sum = shared_sum[0]

if global_i < input_size:
    out[global_i] = out[global_i] / block_sum

Each thread:

  1. Reads the total sum from shared memory
  2. Divides its exponential value by this sum
  3. Writes the normalized probability to the output buffer
  4. This produces a valid probability distribution that sums to 1

Performance characteristics

The implementation has excellent performance characteristics:

  • Complexity: \(O(\log n)\) for both max and sum calculations vs \(O(n)\) in a sequential approach
  • Memory efficiency: Uses only \(2 \times TPB\) elements of shared memory
  • Work efficiency: Each thread performs approximately \(2 \times \log_2(n)\) operations
  • Load balancing: Each thread handles the same amount of work
  • Synchronization: Uses minimal barriers, only where necessary
  • Memory access: Coalesced global memory access pattern for optimal bandwidth

The algorithm is also numerically robust, handling potential overflow/underflow cases by applying the max-subtraction technique that maintains precision across the wide range of values common in neural network activations.

CPU fallback implementation:

fn softmax_cpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, layout, MutableAnyOrigin],
    input: LayoutTensor[dtype, layout, MutableAnyOrigin],
):
    var max_val: Scalar[dtype] = min_finite[dtype]()
    for i in range(input_size):
        max_val = max(max_val, rebind[Scalar[dtype]](input[i]))

    var sum_exp: Scalar[dtype] = 0.0
    for i in range(input_size):
        var exp_val = rebind[Scalar[dtype]](exp(input[i] - max_val))
        output[i] = exp_val
        sum_exp += exp_val

    for i in range(input_size):
        output[i] = output[i] / sum_exp


Our CPU implementation provides a sequential fallback that follows the same mathematical approach but is optimized for single-threaded execution. Let's analyze each phase:
  1. Maximum Finding:

    var max_val: Scalar[dtype] = min_finite[dtype]()
    for i in range(input_size):
        max_val = max(max_val, rebind[Scalar[dtype]](input[i]))
    

    We initialize with the minimum finite value and perform a linear scan through the array, keeping track of the maximum value encountered. This has \(O(n)\) complexity but works efficiently on CPU where we don’t have many cores to parallelize across.

  2. Exponential Computation and Summation:

    var sum_exp: Scalar[dtype] = 0.0
    for i in range(input_size):
        var exp_val = rebind[Scalar[dtype]](exp(input[i] - max_val))
        out[i] = exp_val
        sum_exp += exp_val
    

    We compute \(e^{x_i - max}\) for each element, store the result in the output buffer, and accumulate the sum \(\sum_{j=1}^{n} e^{x_j - max}\) in a single pass. This approach minimizes memory operations compared to using separate loops.

  3. Normalization:

    for i in range(input_size):
        out[i] = out[i] / sum_exp
    

    Finally, we normalize each element by dividing by the sum, producing a proper probability distribution according to the softmax formula:

    $$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

The CPU implementation uses the same numerical stability technique (subtracting the maximum) but with sequential operations rather than parallel ones. It’s simpler than the GPU version since it doesn’t need to handle shared memory or thread synchronization, but it’s also less efficient for large inputs.

Both implementations are registered with MAX Graph’s custom operation system through the @compiler.register("softmax") decorator, allowing seamless execution on either device type based on availability.

Python integration:

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
        input_value = graph.inputs[0]

        # The output shape is the same as the input for softmax
        # Note: the name must match the name used in `@compiler.register("softmax")` in op/softmax.mojo
        output = ops.custom(
            name="softmax",
            values=[input_value],
            out_types=[
                TensorType(
                    dtype=input_value.tensor.dtype,
                    shape=input_value.tensor.shape,
                    device=DeviceRef.from_device(device),
                )
            ],
            parameters={
                "input_size": input_tensor.shape[0],
                "dtype": dtype,
            },
        )[0].tensor
        graph.output(output)

The Python integration creates a seamless bridge between NumPy arrays and our optimized Mojo GPU kernel. The implementation consists of several key components:
  1. Graph Setup and Configuration:

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
    

    This creates a computation graph named “softmax_graph” that:

    • Defines the input tensor type with proper dtype and shape
    • Maps the tensor to the target device (CPU or GPU)
    • Loads our custom Mojo operations from the specified directory
    • The custom_extensions parameter is crucial for linking to our Mojo implementation
  2. Custom Operation Configuration:

    output = ops.custom(
        name="softmax",
        values=[input_value],
        out_types=[
            TensorType(
                dtype=input_value.tensor.dtype,
                shape=input_value.tensor.shape,
                device=DeviceRef.from_device(device),
            )
        ],
        parameters={
            "input_size": input_tensor.shape[0],
            "dtype": dtype,
        },
    )[0].tensor
    

    This sets up our custom operation with:

    • Name matching the @compiler.register("softmax") in our Mojo code
    • Input values passed as a list
    • Output type definition matching the input shape and type
    • Parameters required by our kernel, including the vector size and data type
    • We extract the tensor from the first returned element with [0].tensor
  3. Graph Output Definition:

    graph.output(output)
    

    This registers our operation’s result as the graph’s output.

The main script includes comprehensive testing that:

  • Generates random input data: np.random.randn(INPUT_SIZE).astype(np.float32)
  • Calculates expected results with SciPy: scipy_softmax(input_array)
  • Verifies numerical accuracy: np.testing.assert_allclose(..., rtol=1e-5)
  • Confirms the output is a valid probability distribution: np.sum(result.to_numpy())

This implementation showcases the power of MAX Graph for integrating high-performance Mojo kernels with Python’s scientific computing ecosystem, providing both efficiency and ease of use.