Phi 3.5 Vision Instruct
Model Summary
Phi-3.5-vision is a cutting-edge, multilingual multimodal model designed for tasks requiring advanced image and text comprehension. Developed from quality datasets, including synthetic and public web data, it features enhanced reasoning capabilities. It supports a 128K context length and benefits from fine-tuning and preference optimization for precise instructions and robust safety measures.
Intended Uses
Primary Use Cases
Phi-3.5-vision is optimized for commercial and research purposes in English, accommodating various AI systems requiring image and text inputs. Main applications include:
- Computation-constrained environments.
- Scenarios with latency limitations.
- General image understanding.
- Optical character recognition.
- Chart and table analysis.
- Multiple image comparison.
- Summarization of multiple images or video clips.
This model aids in accelerating research on language and multimodal models, serving as a foundation for generative AI features.
Use Case Considerations
This model is not tailored for all downstream purposes. Developers must assess language model limitations and address accuracy, safety, and fairness before specific applications, especially in high-risk scenarios. Adherence to applicable laws or regulations (including privacy and trade laws) is crucial.
Release Notes
Phi-3.5-vision enhances multi-frame image understanding, based on customer feedback. Features include detailed image comparisons, multi-image summarization, and video summarization, relevant for Office applications. Performance improvements are observed in single image benchmarks, with increased MMMU, MMBench, and document understanding scores. Broad use cases could benefit from this release, encouraging community feedback.
Responsible AI Considerations
The Phi model family, like others, can exhibit biases, inaccuracies, or offensive content. Developers must ensure responsible AI practices and legal compliance. Key considerations include:
- Quality of Service: Performance may vary with non-English languages or English varieties underrepresented in training data.
- Representation and Stereotypes: Potential for over- or under-representation, leading to reinforced stereotypes.
- Content Offensiveness: Inappropriateness in certain contexts necessitating additional mitigations.
- Information Reliability: Potential for nonsensical or outdated content requiring accuracy checks.
- Code Scope: Recommendations for verification of Python scripts generated with non-standard packages.
Training
Phi-3.5-vision features a 4.2B parameter architecture with components like an image encoder and a Phi-3 Mini language model. It handles text and image inputs and supports up to 128K tokens. Training utilized 256 A100-80G GPUs over six days, using 500B vision and text tokens. Trained between July and August 2024, it was released in August 2024.
Benchmarks
The model excels across image-related benchmarks, outpacing several competitors in multi-image and video data processing.
Safety Evaluation and Red-Teaming
Safety alignment involves SFT and RLHF approaches with evaluations like red teaming and adversarial simulations. Detailed techniques are outlined in the technical report.
Software & Hardware
Optimized using PyTorch, Transformers, and Flash-Attention libraries, the model is tested on GPUs like NVIDIA A100, A6000, and H100.
License
The model is released under the MIT License.
Citations