Llama 2

The 2nd gen large language model released by Meta. The structure of this implementation of the model was inspired by Andrej Karpathy's llama.c, and originally written in Mojo by Aydyn Tairov. The model itself has been constructed from end to end in the Mojo language using the MAX Graph API.

Text Generation
INT4
Transformer
Meta
Run with 1 command on desktop

Read the code and instructions, and run on desktop.

Llama 2

The 2nd gen large language model released by Meta. The structure of this implementation of the model was inspired by Andrej Karpathy's llama.c, and originally written in Mojo by Aydyn Tairov. The model itself has been constructed from end to end in the Mojo language using the MAX Graph API.

Text Generation
INT4
Transformer
Meta
Run with 1 command on desktop

Read the code and instructions, and run on desktop.

Llama 2

Language: Mojo 🔥

API: MAX Graph

This pipeline demonstrates text completion from an initial prompt using the Llama 2 large language model. The model itself has been constructed from end to end in the Mojo language using the MAX Graph API.

The MAX Graph API provides an accessible Mojo interface to the construction of flexible accelerated compute graphs, which are then optimized by the MAX Engine's advanced graph compiler. This pipeline showcases how a large language model can be fully defined using Mojo and MAX Graphs and then compiled for optimal inference performance via the MAX Engine.

Model

Llama 2 is an open source large language model released by Meta. The structure of this implementation was inspired by Andrej Karpathy's llama2.c, and originally written in Mojo by Aydyn Tairov.

The text completion demo is compatible with the the official Llama 2 text completion demo.

The default settings for this pipeline use the 7B set of pretrained weights in q4_k quantized encodings.

Usage

The easiest way to try out this pipeline is with our Magic command-line tool. Follow the instructions to install Magic. Once installed, you can try out text generation using Llama 2 with the following command:

magic run llama2 --prompt "I believe the meaning of life is"

On first execution, the tokenizer library and model weights will be downloaded and placed in a .cache/modular subdirectory within your home directory. The model will then be compiled and text completion will begin from the specified prompt.

To modify or build upon the pipeline code, you can use the following steps:

  1. Install MAX:

    If MAX is not already installed, follow the installation instructions to set it up on your system.

  2. Clone the MAX examples repository:

    If you don't already have a local clone of this repository, create one via:

    git clone https://github.com/modularml/max.git
    

    The following instructions assume that you're present within this pipeline's directory, and you can change to it after cloning:

    cd max/examples/graph-api/pipelines/llama2/
    
  3. (Optional) Install Python dependencies:

    This enables using the HuggingFace transformers AutoTokenizer. If transformers isn't found, a Mojo tokenizer implementation is used.

    python3 -m pip install -r requirements.txt
    
  4. Run the text completion demo:

    To access the llama models, you need to agree to their license in Huggingface.

    License is located here Llama-2-7b-hf

    All of the pipelines have been configured to use a common driver, located in the directory hosting all MAX Graph examples. Assuming you're starting at the path of this README, the command invocation will look like:

    mojo ../../run_pipeline.🔥 llama2 --prompt "I believe the meaning of life is"
    

Options

The following command-line options are available to customize operation of the pipeline:

  • --model-path: Overrides the default URL, and allows for an already-downloaded pretrained weight file to be used with the model.
  • --custom-ops-path: The path to a compiled Mojo package containing a custom graph operation to use within the pipeline.
  • --tokenizer-path: The path to the tokenizer library to be used by the pipeline. (Default value: .cache/tokenizer.bin)
  • --max-length: The context length of the model. (Default value: 512)
  • --max-new-tokens: The maximum number of new tokens to generate. If a -1 value is provided, the model will continue to generate tokens for the entire context length. (Default value: -1)
  • --min-p: The starting required percentage for Min P sampling. (Default value: 0.05)
  • --prompt: The text prompt to use for further generation.
  • --quantization-encoding: The encoding to use for a datatype that can be quantized to a low bits per weight format. The options for quantized formats will download and cache default weights, but float32 requires the use of --model-path to specify locally downloaded full-precision weights for use in the model. Valid values: q4_0, q4_k, q6_k, float32. (Default value: q4_0).
  • --temperature: The temperature for sampling, on a scale from 0.0 - 1.0, with 0.0 being greedy sampling. (Default value: 0.5)

Ideas for future extension

There are many ways that this pipeline can be built upon or extended, and this is a short list of suggestions for future work:

  • Enhance the tokenizer so that it can stand alone as a general-purpose tokenizer for multiple text generation pipelines.
  • Expand the customizable options for text generation.
  • Incorporate and use weights from other models.
  • Improve the quality of the text generation.
  • Identify performance bottlenecks and further tune time-to-first-token and throughput.

@ Copyright - Modular Inc - 2024