Recipes

recipes

/

Generate Embeddings with MAX Serve

Help us improve and tell us what you’d like us to build next.

Request a recipe topic

README

Embeddings are a crucial component of intelligent agents, enabling efficient search and retrieval of proprietary information. MAX supports creation of embeddings using an OpenAI-compatible API, including the ability to run the popular sentence-transformers/all-mpnet-base-v2 model from Hugging Face. When you run MPNet on MAX, you'll be serving a high-performance implementation of the model built by Modular engineers with the MAX Graph API.

In this recipe you will:

  • Run an OpenAI-compatible embeddings endpoint on MAX Serve with Docker
  • Generate embeddings with MPNet using the OpenAI Python client

About MPNet

MPNet works by encoding not only tokens (words and parts of words) but also positional data about where those tokens appear in a sentence. Upon its publication in 2020, MPNet met or exceeded the capability of popular predecessors, BERT and XLNet. Today, it is one of the most popular open-source models for generating embeddings.

Requirements

Please make sure your system meets our system requirements.

To proceed, ensure you have the magic CLI installed:

curl -ssL https://magic.modular.com/ | bash

or update it via:

magic self-update

A valid Hugging Face token is required to access the model. Once you have obtained the token, include it in .env by:

cp .env.example .env

then add your token in .env

HUGGING_FACE_HUB_TOKEN=

GPU requirements

For running the app on GPU, ensure your system meets these GPU requirements:

Docker and Docker Compose are optional. Note that this recipe works on compatible Linux machines. We are actively working on enabling MAX Serve Docker image for MacOS ARM64 as well.

Quick start

  1. Download the code for this recipe using git:
git clone https://github.com/modular/max-recipes.git
cd max-recipes/max-serve-openai-embeddings
  1. Run the embedding application
magic run app

This command is defined in the pyproject.toml file and invokes the max-pipelines CLI using Procfile for convenience.

MAX Serve is ready once you see a line containing the following in the Docker output:

Server running on http://0.0.0.0:8000/

When the embedding code in main.py runs, you should see output like this:

=== Generated embeddings with OpenAI client ===
Successfully generated embeddings!
Number of embeddings: 5
Embedding dimension: 768
1st few values of 1st embedding: [0.36384445428848267, -0.7647817730903625, ...]
  1. And once done with the app, to clean up the resources run:
magic run clean

Understand the code

The code for this recipe is intentionally simple — we're excited for you to start building your own project on MAX.

Open up main.py in your code editor. At the top of the file, you'll see the following:

from openai import OpenAI

MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"  #1
BASE_URL="http://localhost:8000/v1"
API_KEY="local"

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)  #2

def main():
    """Test embeddings using OpenAI client"""
    sentences = [  #3
        "Rice is often served in round bowls.",
        "The juice of lemons makes fine punch.",
        "The bright sun shines on the old garden.",
        "The soft breeze came across the meadow.",
        "The small pup gnawed a hole in the sock."
    ]

    try:
        response = client.embeddings.create(  #4
            model=MODEL_NAME,
            input=sentences
        )
        print("\n=== Generated embeddings with OpenAI client ===")
        print("Successfully generated embeddings!")
        print(f"Number of embeddings: {len(response.data)}")  #5
        print(f"Embedding dimension: {len(response.data[0].embedding)}")
        print(f"First embedding, first few values: {response.data[0].embedding[:5]}")
    except Exception as e:
        print(f"Error using client: {str(e)}")

if __name__ == "__main__":
    main()

Here's what the code does:

  1. Sets constants for the model name, MAX Serve URL and API key. (Note: You can use any value for API_KEY; MAX Serve does not use one, but the OpenAI client requires this value not be blank.)
  2. Initializes the OpenAI client.
  3. Defines a list of sample sentences. (The samples here are taken from the Harvard Sentences.)
  4. Uses the OpenAI client with MAX Serve to generate the embeddings.
  5. Accesses the embeddings the OpenAI client returns.

Note how the code is a drop-in replacement for the proprietary OpenAI API---this is a key advantage of building with MAX!

What's next?

Now that you've created embeddings with MAX Serve, you can explore more features and join our developer community. Here are some resources to help you continue your journey:

DETAILS

AVAILABLE TASKS

magic run app
magic run clean

PROBLEMS WITH THE CODE?

File an Issue

TAGS

max-serve

/

embeddings

Help us improve and tell us what you’d like us to build next.

Request a recipe topic

@ Copyright - Modular Inc - 2025