Image to JSON: Multimodal Structured Output with Llama 3.2 Vision, MAX Serve and Pydantic
Help us improve and tell us what you’d like us to build next.
Request a recipe topicREADME
In this recipe, we will cover:
We'll walk through building a solution that showcases:
Let's get started.
Please make sure your system meets our system requirements.
To proceed, ensure you have the magic
CLI installed with the magic --version
to be 0.7.2 or newer:
curl -ssL https://magic.modular.com/ | bash
or update it via:
magic self-update
Then install max-pipelines
via:
magic global install max-pipelines=="25.2.0.dev2025031705"
For this recipe, you will need:
Set up your environment variables:
export HUGGING_FACE_HUB_TOKEN=
Structured output with MAX Serve requires GPU access. For running the app on GPU, ensure your system meets these GPU requirements:
Download the code for this recipe using the magic
CLI:
magic init max-serve-multimodal-structured-output --from modular/max-recipes/max-serve-multimodal-structured-output
cd max-serve-multimodal-structured-output
Run the server with vision model:
Make sure the port 8010
is available. You can adjust the port settings in pyproject.toml.
magic run server
This will start MAX Serve with Llama 3.2 Vision and
Run the example code that extracts player information from a basketball image via:
magic run python main.py
The output will look like:
{
"players": [
{
"name": "Klay Thompson",
"number": 11
},
{
"name": "Stephen Curry",
"number": 30
},
{
"name": "Kevin Durant",
"number": 35
}
]
}
The core of our structured output system uses Pydantic models to define the expected data structure:
from pydantic import BaseModel
class Player(BaseModel):
name: str = Field(description="Player name on jersey")
number: int = Field(description="Player number on jersey")
class Players(BaseModel):
players: List[Player] = Field(description="List of players visible in the image")
How it works:
The application uses the OpenAI client to communicate with MAX Serve:
from openai import OpenAI
client = OpenAI(api_key="local", base_url="http://localhost:8000/v1")
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{"role": "system", "content": "Extract player information from the image."},
{"role": "user", "content": [
{
"type": "text",
"text": "Please provide a list of players visible in this photo with their jersey numbers."
},
{
"type": "image_url",
"image_url": {
"url": "https://ei.marketwatch.com/Multimedia/2019/02/15/Photos/ZH/MW-HE047_nbajer_20190215102153_ZH.jpg"
}
}
]},
],
response_format=Players,
)
Key components:
To enable structured output in MAX Serve, simply include --enable-structured-output
.
To see other options, make sure to check out the help:
max-pipelines serve --help
In this recipe, we've built a system for extracting structured data from images using Llama 3.2 Vision and MAX Serve. We've explored:
This implementation provides a foundation for building more complex vision-based applications with structured output.
We're excited to see what you'll build with Llama 3.2 Vision and MAX! Share your projects and experiences with us using #ModularAI
on social media.
DETAILS
AUTHOR
Ehsan M. Kermani
AVAILABLE TASKS
magic run server
magic run main
PROBLEMS WITH THE CODE?
File an Issue
TAGS
Help us improve and tell us what you’d like us to build next.
Request a recipe topic