Build a Continuous Chat App with MAX Serve and Llama 3
Help us improve and tell us what you’d like us to build next.
Request a recipe topicREADME
In this recipe, we will cover:
We'll walk through building a solution that showcases
Let's get started.
Please make sure your system meets our system requirements.
To proceed, ensure you have the magic
CLI installed:
curl -ssL https://magic.modular.com/ | bash
or update it via:
magic self-update
You'll need:
Set up your environment variables:
cp .env.sample .env
echo "HUGGING_FACE_HUB_TOKEN=your_hf_token" > .env
For running the app on GPU, ensure your system meets these GPU requirements:
Docker and Docker Compose are optional. Note that this recipe works on compatible Linux machines. We are actively working on enabling MAX Serve Docker image for MacOS ARM64 as well.
git clone https://github.com/modular/max-recipes.git
cd max-recipes/max-serve-continuous-chat
magic run app
Once the Llama3 server and UI server are running, open http://localhost:7860 to view the chat interface.
magic run clean
If you don't have access to the supported NVIDIA GPU locally, you can instead follow our tutorials on deploying Llama 3 on GPU with MAX Serve to AWS, GCP or Azure or on Kubernetes to get a public IP (running on port 80) and then run the UI component separately as follows:
# required only once
docker buildx create --use --name mybuilder
# Intel, AMD
docker buildx bake --load --set "ui.platform=linux/amd64"
# OR for ARM such as Apple M-series
docker buildx bake --load --set "ui.platform=linux/arm64"
BASE_URL
to the available endpoint:docker run -p 7860:7860 \
-e "BASE_URL=http:///v1" \
-e "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
llama3-chat-ui
docker buildx rm mybuilder
Our chat application consists of three main components:
A key feature of our chat application is the rolling context window. This mechanism ensures that conversations remain coherent and contextually relevant without overwhelming system resources. Here's an in-depth look at how this is achieved:
1. Dynamic token managementThe ChatConfig
class is responsible for tracking token usage and maintaining a rolling window of messages within the configured token limit. Tokens are the fundamental units processed by language models, and managing them efficiently is crucial for performance and cost-effectiveness.
class ChatConfig:
def __init__(self, base_url: str, max_context_window: int):
self.base_url = base_url
self.max_context_window = max_context_window
self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
def count_tokens(self, messages: List[Dict]) -> int:
num_tokens = 0
for message in messages:
text = f"<|im_start|>{message['role']}\\n{message['content']}<|im_end|>\\n"
num_tokens += len(self.tokenizer.encode(text))
return num_tokens
How it works:
<|im_start|>
and <|im_end|>
) to denote the start and end of a message. The tokenizer then encodes this text and counts the number of tokens.max_context_window
parameter defines the maximum number of tokens allowed in the conversation context. This ensures that the application doesn't exceed the model's capacity, maintaining efficiency.To maintain the conversation's relevance, the latest user and system messages are always included. Older messages are trimmed dynamically when the token count exceeds the window size.
if chat_history:
for user_msg, bot_msg in reversed(chat_history):
new_messages = [
{"role": "user", "content": user_msg},
{"role": "assistant", "content": bot_msg},
]
history_tokens = config.count_tokens(new_messages)
if running_total + history_tokens ≤ config.max_context_window:
history_messages = new_messages + history_messages
running_total += history_tokens
else:
break
How it works:
max_context_window
, they are included in the active context.By keeping the active context concise and relevant, the system optimizes resource usage and maintains high performance even during extended interactions. This approach prevents unnecessary memory consumption and ensures the application remains responsive.
Chat user-interfaceThe UI logic is included in ui.py file and is central to the continuous chat interface. Here's how it enables the chat system:
Gradio provides a user-friendly interface, making interactions intuitive and accessible.
def create_interface(config: ChatConfig, client, system_prompt, concurrency_limit: int = 1):
with gr.Blocks(theme="soft") as iface:
gr.Markdown("# Chat with Llama 3 model\n\nPowered by Modular [MAX](https://docs.modular.com/max/) 🚀")
chatbot = gr.Chatbot(height=400)
msg = gr.Textbox(label="Message", placeholder="Type your message here...")
clear = gr.Button("Clear")
initial_usage = f"**Total Tokens Generated**: 0 | Context Window: {config.max_context_window}"
token_display = gr.Markdown(initial_usage)
async def respond_wrapped(message, chat_history):
async for response in respond(message, chat_history, config, client, system_prompt):
yield response
msg.submit(
respond_wrapped,
[msg, chatbot],
[chatbot, token_display],
api_name="chat"
).then(lambda: "", None, msg)
clear.click(lambda: ([], initial_usage), None, [chatbot, token_display], api_name="clear")
iface.queue(default_concurrency_limit=concurrency_limit)
return iface
Key components:
The interface communicates with the Llama 3 model via the MAX Serve API to fetch chat completions.
async def respond(message, chat_history, config: ChatConfig, client, system_prompt):
chat_history = chat_history or []
if not isinstance(message, str) or not message.strip():
yield chat_history, f"**Active Context**: 0/{config.max_context_window}"
return
messages = [system_prompt]
current_message = {"role": "user", "content": message}
messages.extend(history_messages)
messages.append(current_message)
chat_history.append([message, None])
response = await client.chat.completions.create(
model=config.model_repo_id,
messages=messages,
stream=True,
max_tokens=config.max_context_window,
)
for chunk in response:
if hasattr(chunk.choices[0].delta, 'content'):
bot_message += chunk.choices[0].delta.content
chat_history[-1][1] = bot_message
yield chat_history, f"**Active Context**: {running_total}/{config.max_context_window}"
The wait_for_healthy
function ensures the MAX Serve API is ready before processing requests, retrying until the server is live.
from tenacity import (
retry,
stop_after_attempt,
wait_fixed,
retry_if_exception_type,
retry_if_result,
)
def wait_for_healthy(base_url: str):
@retry(
stop=stop_after_attempt(20),
wait=wait_fixed(60),
retry=(
retry_if_exception_type(requests.RequestException)
| retry_if_result(lambda x: x.status_code != 200)
)
)
def _check_health():
return requests.get(f"{base_url}/health", timeout=5)
return _check_health()
When deploying your chat application, consider these key factors:
Context window size:
--max-length
).Continuous batching:
MAX_BATCH_SIZE
controls concurrent request handling via the continuous batching (in MAX Serve --max-batch-size
).Memory management:
nvidia-smi
.You can explore various configuration options by running:
magic global install max-pipelines
max-pipelines serve --help
On the serving side, make sure to check out the benchmarking tutorial and the benchmarking blog too.
In this recipe, we've built a functional chat application using Llama 3 and MAX Serve. We've explored:
This recipe demonstrates how MAX Serving stack can be combined with Llama 3 to create interactive chat applications. While this implementation focuses on the basics, it provides a foundation that you can build upon for your own projects.
Deploy Llama 3 on GPU with MAX Serve to AWS, GCP or Azure or on Kubernetes.
Explore MAX's documentation for additional features.
Join our Modular Forum and Discord community to share your experiences and get support.
We're excited to see what you'll build with Llama 3 and MAX! Share your projects and experiences with us using #ModularAI
on social media.
DETAILS
THE CODE
max-serve-continuous-chat
AUTHOR
Ehsan M. Kermani
AVAILABLE TASKS
magic run app
magic run clean
PROBLEMS WITH THE CODE?
File an Issue
TAGS
Help us improve and tell us what you’d like us to build next.
Request a recipe topicENTERPRISES
@ Copyright - Modular Inc - 2025