Offline inference with MAX Recipe

Recipes

recipes / Offline inference with MAX

Help us improve and tell us what you’d like us to build next.

README

Modern AI development doesn't need to require external API's or complex infrastructure. With MAX, you can run state of the art AI models using only a few lines of Python code. This recipe shows you how to pair MAX with Hugging Face, the leading platform for open-source AI models, to perform inference locally and efficiently. Whether you're a developer experimenting with AI or an enterprise running offline batch inference jobs, MAX provides a simple path to get up and running.

In this recipe you will:

Use MAX to run inference with models from Hugging Face
Generate text completions using the Llama-3.1 8B model

Requirements

To proceed, ensure you have the magic CLI installed with the magic --version to be 0.7.2 or newer:

curl -ssL https://magic.modular.com/ | bash

Or update it via:

magic self-update

A valid Hugging Face token is required to access the model. Once you have obtained the token, include it in .env by:

cp .env.example .env

then add your token in .env

HUGGING_FACE_HUB_TOKEN=

Quick start

Download the code for this recipe using the magic CLI:

magic init max-offline-inference --from modular/max-recipes/max-offline-inference
cd max-offline-inference

To run the offline inference example:
```
magic run app
```

This will execute the sample script main.py which loads the LLama-3.1 8B model and generates text from a few prompts, like so:

========== Batch 0 ==========
Response 0:  Chicago Cubs
The Chicago Cubs won the World Series in 2016, ending a 108-year championship drought. The Cubs defeated the Cleveland Indians in the series, 4
Response 1:  Los Angeles Dodgers
The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in six games. The Dodgers won the final game 3-1

========== Batch 1 ==========
Response 2:  (Note: This is a placeholder for the actual winner, which will be determined by the outcome of the 2024 season.)
The winner of the World Series in 202
Response 3:  (Select one)
The New York Yankees
The Los Angeles Dodgers
The Boston Red Sox
The Chicago Cubs
The Houston Astros
Other (please specify)  ## Step

Understanding the code

Let's break down the key components of the sample code.

Configure and initialize model

#1
register_all_models()
huggingface_repo_id = "modularai/Llama-3.1-8B-Instruct-GGUF"

#2
pipeline_config = PipelineConfig(huggingface_repo_id, max_batch_size=2)
llm = LLM(pipeline_config)

This initial block:

Registers available model architectures defined within MAX
Configures the pipeline via PipelineConfig and initializes the LLM class

Don't let the Modular name in the huggingface_repo_id limit you---MAX works with any PyTorch model from Hugging Face. Through the MAX Graph API, certain model architectures (like LlamaForCausalLM) receive automatic performance optimizations when run with MAX.

Run inference with the model

#1
prompts = [
    "The winner of the World Series in 2016 was",
]

#2
responses = llm.generate(prompts, max_new_tokens=35)  #3

The inference code:

Defines one or more prompts for the model
Uses the generate(...) method to create text completions
Limits response length with max_new_tokens

The complete sample code includes additional features like handling of the Hugging Face access token and formatting output for display. You can find the full implementation in: main.py

What's next?

Now that you've run offline inference with MAX, you can explore more features and join our developer community:

Get started with MAX
Explore MAX Serve and MAX Container
Learn more about magic CLI in this Magic tutorial
Join the Modular forum

DETAILS

THE CODE

max-offline-inference

AUTHOR

Bill Welense

AVAILABLE TASKS

magic run app

PROBLEMS WITH THE CODE?

File an Issue