llama3-chatqa-8b

MAX Model

1 versions

A model from NVIDIA based on Llama 3 that excels at conversational question answering (QA) and retrieval-augmented generation (RAG).

Run this model

  1. Install our magic package manager:

    curl -ssL https://magic.modular.com/ | bash

    Then run the source command that's printed in your terminal.

  2. Install Max Pipelines in order to run this model.

    magic global install max-pipelines
  3. Start a local endpoint for llama3-chatqa/8b:

    max-serve serve --huggingface-repo-id nvidia/Llama3-ChatQA-1.5-70B

    The endpoint is ready when you see the URI printed in your terminal:

    Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
  4. Now open another terminal to send a request using curl:

    curl -N http://0.0.0.0:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "llama3-chatqa/8b",
        "stream": true,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the World Series in 2020?"}
        ]
    }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '
    ' | sed 's/\n/
    /g'
  5. 🎉 Hooray! You’re running Generative AI. Our goal is to make this as easy as possible.

About

ChatQA-1.5 is an advanced conversational AI model built on the Llama-3 base architecture. It is specifically designed to perform well in conversational question-answering tasks, with enhanced capabilities in tabular data interpretation and arithmetic calculations. This makes it particularly effective for applications requiring precise contextual understanding and numerical reasoning.

ChatQA-1.5 comes in two model sizes:

  • Llama3-ChatQA-1.5-8B: A smaller, more efficient variant suitable for lightweight deployments.
  • Llama3-ChatQA-1.5-70B: A more powerful variant optimized for complex, high-performance conversational tasks.

These models are fine-tuned using a diverse corpus of conversational QA data to ensure robust performance across various real-world scenarios.

References

Website

Hugging Face

DETAILS

MODEL CLASS
MAX Model

MAX Models are extremely optimized inference pipelines to run SOTA performance for that model on both CPU and GPU. For many of these models, they are the fastest version of this model in the world.

Browse 18+ MAX Models

MODULAR GITHUB

Modular

CREATED BY

nvidia

MODEL

nvidia/Llama3-ChatQA-1.5-70B

TAGS

arxiv:2401.10225
autotrain_compatible
chatqa
chatqa-1.5
conversational
en
endpoints_compatible
license:llama3
llama
llama-3
nvidia
pytorch
region:us
safetensors
text-generation
text-generation-inference
transformers

@ Copyright - Modular Inc - 2024