Learn How to Run Offline Inference with MAX Pipelines
Help us improve and tell us what you’d like us to build next.
Request a recipe topicREADME
Modern AI development doesn't need to require external API's or complex infrastructure. With MAX, you can run state of the art AI models using only a few lines of Python code. This recipe shows you how to pair MAX with Hugging Face, the leading platform for open-source AI models, to perform inference locally and efficiently. Whether you're a developer experimenting with AI or an enterprise running offline batch inference jobs, MAX provides a simple path to get up and running.
In this recipe you will:
To proceed, ensure you have the magic
CLI installed:
curl -ssL https://magic.modular.com/ | bash
Or update it via:
magic self-update
A valid Hugging Face token is required to access the model.
Once you have obtained the token, include it in .env
by:
cp .env.example .env
then add your token in .env
HUGGING_FACE_HUB_TOKEN=
git clone https://github.com/modular/max-recipes.git
cd max-recipes/max-offline-inference
magic run app
This will execute the sample script main.py
which loads the LLama-3.1 8B model and generates text from a few prompts, like so:
========== Batch 0 ==========
Response 0: Chicago Cubs
The Chicago Cubs won the World Series in 2016, ending a 108-year championship drought. The Cubs defeated the Cleveland Indians in the series, 4
Response 1: Los Angeles Dodgers
The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in six games. The Dodgers won the final game 3-1
========== Batch 1 ==========
Response 2: (Note: This is a placeholder for the actual winner, which will be determined by the outcome of the 2024 season.)
The winner of the World Series in 202
Response 3: (Select one)
The New York Yankees
The Los Angeles Dodgers
The Boston Red Sox
The Chicago Cubs
The Houston Astros
Other (please specify) ## Step
Let's break down the key components of the sample code.
#1
register_all_models()
huggingface_repo_id = "modularai/Llama-3.1-8B-Instruct-GGUF"
#2
pipeline_config = PipelineConfig(huggingface_repo_id, max_batch_size=2)
llm = LLM(pipeline_config)
This initial block:
Don't let the Modular name in the huggingface_repo_id
limit you---MAX works with any PyTorch model from Hugging Face. Through the MAX Graph API, certain model architectures (like LlamaForCausalLM) receive automatic performance optimizations when run with MAX.
#1
prompts = [
"The winner of the World Series in 2016 was",
]
#2
responses = llm.generate(prompts, max_new_tokens=35) #3
The inference code:
generate(...)
method to create text completionsmax_new_tokens
The complete sample code includes additional features like handling of the Hugging Face access token and formatting output for display. You can find the full implementation in: main.py
Now that you've run offline inference with MAX, you can explore more features and join our developer community:
magic
 CLI in this Magic tutorialDETAILS
THE CODE
max-offline-inference
AUTHOR
Bill Welense
AVAILABLE TASKS
magic run app
PROBLEMS WITH THE CODE?
File an Issue
TAGS
Help us improve and tell us what you’d like us to build next.
Request a recipe topicENTERPRISES
@ Copyright - Modular Inc - 2025