Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following: - Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian - Vision: English - Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following: - Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian - Vision: English - Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
Phi-4-multimodal-instruct
Ask me anything
license: mit license_link: https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/LICENSE language:
Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:
đź“° Phi-4-multimodal Microsoft Blog
đź“– Phi-4-multimodal Technical Report
🏡 Phi Portal
👩‍🍳 Phi Cookbook
🖥️ Try It on Azure,
GitHub,
Nvidia,
Huggingface playgrounds
📱Huggingface Spaces
Thoughts Organizer,
Stories Come Alive,
Phine Speech Translator
🎉Phi-4: [multimodal-instruct | onnx]; [mini-instruct | onnx]
Watch as Phi-4 Multimodal analyzes spoken language to help plan a trip to Seattle, demonstrating its advanced audio processing and recommendation capabilities.
See how Phi-4 Multimodal tackles complex mathematical problems through visual inputs, demonstrating its ability to process and solve equations presented in images.
Explore how Phi-4 Mini functions as an intelligent agent, showcasing its reasoning and task execution abilities in complex scenarios.
The model is intended for broad multilingual and multimodal commercial and research use . The model provides uses for general purpose AI systems and applications which require
The model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.
The model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models and multimodal models, as well as performance difference across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case.
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
This release of Phi-4-multimodal-instruct is based on valuable user feedback from the Phi-3 series. Previously, users could use a speech recognition model to talk to the Mini and Vision models. To achieve this, users needed to use a pipeline of two models: one model to transcribe the audio to text, and another model for the language or vision tasks. This pipeline means that the core model was not provided the full breadth of input information – e.g. cannot directly observe multiple speakers, background noises, jointly align speech, vision, language information at the same time on the same representation space. With Phi-4-multimodal-instruct, a single new open model has been trained across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. The model employed new architecture, larger vocabulary for efficiency, multilingual, and multimodal support, and better post-training techniques were used for instruction following and function calling, as well as additional data leading to substantial gains on key multimodal capabilities. It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
The Phi-4-multimodal-instruct was observed as
The performance of Phi-4-multimodal-instruct on the aggregated benchmark datasets:
The performance of Phi-4-multimodal-instruct on different languages, averaging the WERs of CommonVoice and FLEURS:
Translating from German, Spanish, French, Italian, Japanese, Portugues, Chinese to English:
Translating from English to German, Spanish, French, Italian, Japanese, Portugues, Chinese. Noted that WhiperV3 does not support this capability:
MT bench scores are scaled by 10x to match the score range of MMMLU:
AIR bench scores are scaled by 10x to match the score range of MMAU:
Phi-4-multimodal-instruct is capable of processing both image and audio together, the following table shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other existing state-of-the-art omni models that can enable audio and visual signal as input, Phi-4-multimodal-instruct achieves much stronger performance on multiple benchmarks.
Benchmarks | Phi-4-multimodal-instruct | InternOmni-7B | Gemini-2.0-Flash-Lite-prv-02-05 | Gemini-2.0-Flash | Gemini-1.5-Pro |
---|---|---|---|---|---|
s_AI2D | 68.9 | 53.9 | 62.0 | 69.4 | 67.7 |
s_ChartQA | 69.0 | 56.1 | 35.5 | 51.3 | 46.9 |
s_DocVQA | 87.3 | 79.9 | 76.0 | 80.3 | 78.2 |
s_InfoVQA | 63.7 | 60.3 | 59.4 | 63.6 | 66.1 |
Average | 72.2 | 62.6 | 58.2 | 66.2 | 64.7 |
To understand the vision capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of zero-shot benchmarks using an internal benchmark platform. At the high-level overview of the model quality on representative benchmarks:
Dataset | Phi-4-multimodal-ins | Phi-3.5-vision-ins | Qwen 2.5-VL-3B-ins | Intern VL 2.5-4B | Qwen 2.5-VL-7B-ins | Intern VL 2.5-8B | Gemini 2.0-Flash Lite-preview-0205 | Gemini2.0-Flash | Claude-3.5-Sonnet-2024-10-22 | Gpt-4o-2024-11-20 |
---|---|---|---|---|---|---|---|---|---|---|
Popular aggregated benchmark | ||||||||||
MMMU | 55.1 | 43.0 | 47.0 | 48.3 | 51.8 | 50.6 | 54.1 | 64.7 | 55.8 | 61.7 |
MMBench (dev-en) | 86.7 | 81.9 | 84.3 | 86.8 | 87.8 | 88.2 | 85.0 | 90.0 | 86.7 | 89.0 |
MMMU-Pro (std/vision) | 38.5 | 21.8 | 29.9 | 32.4 | 36.9 | 34.4 | 45.1 | 54.4 | 54.3 | 53.0 |
Visual science reasoning | ||||||||||
ScienceQA Visual (img-test) | 97.5 | 91.3 | 79.4 | 96.2 | 87.7 | 97.3 | 85.0 | 88.3 | 81.2 | 88.2 |
Visual math reasoning | ||||||||||
MathVista (testmini) | 62.4 | 43.9 | 60.8 | 51.2 | 67.8 | 56.7 | 57.6 | 47.2 | 56.9 | 56.1 |
InterGPS | 48.6 | 36.3 | 48.3 | 53.7 | 52.7 | 54.1 | 57.9 | 65.4 | 47.1 | 49.1 |
Chart & table reasoning | ||||||||||
AI2D | 82.3 | 78.1 | 78.4 | 80.0 | 82.6 | 83.0 | 77.6 | 82.1 | 70.6 | 83.8 |
ChartQA | 81.4 | 81.8 | 80.0 | 79.1 | 85.0 | 81.0 | 73.0 | 79.0 | 78.4 | 75.1 |
DocVQA | 93.2 | 69.3 | 93.9 | 91.6 | 95.7 | 93.0 | 91.2 | 92.1 | 95.2 | 90.9 |
InfoVQA | 72.7 | 36.6 | 77.1 | 72.1 | 82.6 | 77.6 | 73.0 | 77.8 | 74.3 | 71.9 |
Document Intelligence | ||||||||||
TextVQA (val) | 75.6 | 72.0 | 76.8 | 70.9 | 77.7 | 74.8 | 72.9 | 74.4 | 58.6 | 73.1 |
OCR Bench | 84.4 | 63.8 | 82.2 | 71.6 | 87.7 | 74.8 | 75.7 | 81.0 | 77.0 | 77.7 |
Object visual presence verification | ||||||||||
POPE | 85.6 | 86.1 | 87.9 | 89.4 | 87.5 | 89.1 | 87.5 | 88.0 | 82.6 | 86.5 |
Multi-image perception | ||||||||||
BLINK | 61.3 | 57.0 | 48.1 | 51.2 | 55.3 | 52.5 | 59.3 | 64.0 | 56.9 | 62.4 |
Video MME 16 frames | 55.0 | 50.8 | 56.5 | 57.3 | 58.2 | 58.7 | 58.8 | 65.5 | 60.2 | 68.2 |
Average | 72.0 | 60.9 | 68.7 | 68.8 | 73.1 | 71.1 | 70.2 | 74.3 | 69.1 | 72.4 |
Below are the comparison results on existing multi-image tasks. On average, Phi-4-multimodal-instruct outperforms competitor models of the same size and competitive with much bigger models on multi-frame capabilities. BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.
Dataset | Phi-4-multimodal-instruct | Qwen2.5-VL-3B-Instruct | InternVL 2.5-4B | Qwen2.5-VL-7B-Instruct | InternVL 2.5-8B | Gemini-2.0-Flash-Lite-prv-02-05 | Gemini-2.0-Flash | Claude-3.5-Sonnet-2024-10-22 | Gpt-4o-2024-11-20 |
---|---|---|---|---|---|---|---|---|---|
Art Style | 86.3 | 58.1 | 59.8 | 65.0 | 65.0 | 76.9 | 76.9 | 68.4 | 73.5 |
Counting | 60.0 | 67.5 | 60.0 | 66.7 | 71.7 | 45.8 | 69.2 | 60.8 | 65.0 |
Forensic Detection | 90.2 | 34.8 | 22.0 | 43.9 | 37.9 | 31.8 | 74.2 | 63.6 | 71.2 |
Functional Correspondence | 30.0 | 20.0 | 26.9 | 22.3 | 27.7 | 48.5 | 53.1 | 34.6 | 42.3 |
IQ Test | 22.7 | 25.3 | 28.7 | 28.7 | 28.7 | 28.0 | 30.7 | 20.7 | 25.3 |
Jigsaw | 68.7 | 52.0 | 71.3 | 69.3 | 53.3 | 62.7 | 69.3 | 61.3 | 68.7 |
Multi-View Reasoning | 76.7 | 44.4 | 44.4 | 54.1 | 45.1 | 55.6 | 41.4 | 54.9 | 54.1 |
Object Localization | 52.5 | 55.7 | 53.3 | 55.7 | 58.2 | 63.9 | 67.2 | 58.2 | 65.6 |
Relative Depth | 69.4 | 68.5 | 68.5 | 80.6 | 76.6 | 81.5 | 72.6 | 66.1 | 73.4 |
Relative Reflectance | 26.9 | 38.8 | 38.8 | 32.8 | 38.8 | 33.6 | 34.3 | 38.1 | 38.1 |
Semantic Correspondence | 52.5 | 32.4 | 33.8 | 28.8 | 24.5 | 56.1 | 55.4 | 43.9 | 47.5 |
Spatial Relation | 72.7 | 80.4 | 86.0 | 88.8 | 86.7 | 74.1 | 79.0 | 74.8 | 83.2 |
Visual Correspondence | 67.4 | 28.5 | 39.5 | 50.0 | 44.2 | 84.9 | 91.3 | 72.7 | 82.6 |
Visual Similarity | 86.7 | 67.4 | 88.1 | 87.4 | 85.2 | 87.4 | 80.7 | 79.3 | 83.0 |
Overall | 61.6 | 48.1 | 51.2 | 55.3 | 52.5 | 59.3 | 64.0 | 56.9 | 62.4 |
Phi-4 family has been integrated in the 4.48.2
version of transformers
. The current transformers
version can be verified with: pip list | grep transformers
.
Examples of required packages:
flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2
Phi-4-multimodal-instruct is also available in Azure AI Studio
Phi-4-multimodal-instruct supports a vocabulary size of up to 200064
tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.
Given the nature of the training data, the Phi-4-multimodal-instruct model is best suited for prompts using the chat format as follows:
This format is used for general conversation and instructions:
<|system|>You are a helpful assistant.<|end|><|user|>How to explain Internet for a medieval knight?<|end|><|assistant|>
This format is used when the user wants the model to provide function calls based on the given tools. The user should provide the available tools in the system prompt, wrapped by <|tool|> and <|/tool|> tokens. The tools should be specified in JSON format, using a JSON dump structure. Example:
<|system|>You are a helpful assistant with some tools.<|tool|>[{"name": "get_weather_updates", "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", "parameters": {"city": {"description": "The name of the city for which to retrieve weather information.", "type": "str", "default": "London"}}}]<|/tool|><|end|><|user|>What is the weather like in Paris today?<|end|><|assistant|>
This format is used for conversation with image:
<|user|><|image_1|>Describe the image in detail.<|end|><|assistant|>
For multiple images, the user needs to insert multiple image placeholders in the prompt as below:
<|user|><|image_1|><|image_2|><|image_3|>Summarize the content of the images.<|end|><|assistant|>
This format is used for various speech and audio tasks:
<|user|><|audio_1|>{task prompt}<|end|><|assistant|>
The task prompt can vary for different task. Automatic Speech Recognition:
<|user|><|audio_1|>Transcribe the audio clip into text.<|end|><|assistant|>
Automatic Speech Translation:
<|user|><|audio_1|>Translate the audio to {lang}.<|end|><|assistant|>
Automatic Speech Translation with chain-of-thoughts:
<|user|><|audio_1|>Transcribe the audio to text, and then translate the audio to {lang}. Use <sep> as a separator between the original transcript and the translation.<|end|><|assistant|>
Spoken-query Question Answering:
<|user|><|audio_1|><|end|><|assistant|>
This format is used for conversation with image and audio. The audio may contain query related to the image:
<|user|><|image_1|><|audio_1|><|end|><|assistant|>
For multiple images, the user needs to insert multiple image placeholders in the prompt as below:
<|user|><|image_1|><|image_2|><|image_3|><|audio_1|><|end|><|assistant|>
Vision
Audio
After obtaining the Phi-4-multimodal-instruct model checkpoints, users can use this sample code for inference.
import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen
# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
# if you do not use Ampere or later GPUs, change attention to "eager"
_attn_implementation='flash_attention_2',
).cuda()
# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)
# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')
# Generate response
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))
# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include:
Phi-4-multimodal-instruct's training data includes a wide variety of sources, totaling 5 trillion text tokens, and is a combination of
Focus was placed on the quality of data that could potentially improve the reasoning ability for the model, and the publicly available documents were filtered to contain a preferred level of knowledge. As an example, the result of a game in premier league on a particular day might be good training data for large foundation models, but such information was removed for the Phi-4-multimodal-instruct to leave more model capacity for reasoning for the model's small size. The data collection process involved sourcing information from publicly available documents, with a focus on filtering out undesirable documents and images. To safeguard privacy, image and text data sources were filtered to remove or scrub potentially personal data from the training data. The decontamination process involved normalizing and tokenizing the dataset, then generating and comparing n-grams between the target dataset and benchmark datasets. Samples with matching n-grams above a threshold were flagged as contaminated and removed from the dataset. A detailed contamination report was generated, summarizing the matched text, matching ratio, and filtered results for further analysis.
A basic example of supervised fine-tuning (SFT) for speech and vision is provided respectively.
The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed for safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. For non-English languages, existing datasets were extended via machine translation. Speech Safety datasets were generated by running Text Safety datasets through Azure TTS (Text-To-Speech) Service, for both English and non-English languages. Vision (text & images) Safety datasets were created to cover harm categories identified both in public and internal multi-modal RAI datasets.
Various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets were leveraged to evaluate Phi-4 models' propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi 3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Details on prior red team evaluations across Phi models can be found in the Phi 3 Safety Post-Training paper. For this release, the red teaming effort focused on the newest Audio input modality and on the following safety areas: harmful content, self-injury risks, and exploits. The model was found to be more susceptible to providing undesirable outputs when attacked with context manipulation or persuasive techniques. These findings applied to all languages, with the persuasive techniques mostly affecting French and Italian. This highlights the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages, and risk areas that account for cultural nuances where those languages are spoken.
To assess model safety in scenarios involving both text and images, Microsoft's Azure AI Evaluation SDK was utilized. This tool facilitates the simulation of single-turn conversations with the target model by providing prompt text and images designed to incite harmful responses. The target model's responses are subsequently evaluated by a capable model across multiple harm categories, including violence, sexual content, self-harm, hateful and unfair content, with each response scored based on the severity of the harm identified. The evaluation results were compared with those of Phi-3.5-Vision and open-source models of comparable size. In addition, we ran both an internal and the public RTVLM and VLGuard multi-modal (text & vision) RAI benchmarks, once again comparing scores with Phi-3.5-Vision and open-source models of comparable size. However, the model may be susceptible to language-specific attack prompts and cultural context.
In addition to extensive red teaming, the Safety of the model was assessed through three distinct evaluations. First, as performed with Text and Vision inputs, Microsoft's Azure AI Evaluation SDK was leveraged to detect the presence of harmful content in the model's responses to Speech prompts. Second, Microsoft's Speech Fairness evaluation was run to verify that Speech-To-Text transcription worked well across a variety of demographics. Third, we proposed and evaluated a mitigation approach via a system message to help prevent the model from inferring sensitive attributes (such as gender, sexual orientation, profession, medical condition, etc...) from the voice of a user.
Note that by default, the Phi-4-multimodal-instruct model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
If you want to run the model on:
The model is licensed under the MIT license.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
We include a brief word on methodology here - and in particular, how we think about optimizing prompts. In an ideal world, we would never change any prompts in our benchmarks to ensure it is always an apples-to-apples comparison when comparing different models. Indeed, this is our default approach, and is the case in the vast majority of models we have run to date. There are, however, some exceptions to this. In some cases, we see a model that performs worse than expected on a given eval due to a failure to respect the output format. For example:
However, we do not:
The goal of the benchmark setup is to measure the performance of the LMM when a regular user utilizes these models for a task involving visual input. To this end, we selected 9 popular and publicly available single-frame datasets and 3 multi-frame benchmarks that cover a wide range of challenging topics and tasks (e.g., mathematics, OCR tasks, charts-and-plots understanding, etc.) as well as a set of high-quality models. Our benchmarking setup utilizes zero-shot prompts and all the prompt content are the same for every model. We only formatted the prompt content to satisfy the model's prompt API. This ensures that our evaluation is fair across the set of models we tested. Many benchmarks necessitate models to choose their responses from a presented list of options. Therefore, we've included a directive in the prompt's conclusion, guiding all models to pick the option letter that corresponds to the answer they deem correct. In terms of the visual input, we use the images from the benchmarks as they come from the original datasets. We converted these images to base-64 using a JPEG encoding for models that require this format (e.g., GPTV, Claude Sonnet 3.5, Gemini 1.5 Pro/Flash). For other models (e.g., Llava Interleave, and InternVL2 4B and 8B), we used their Huggingface interface and passed in PIL images or a JPEG image stored locally. We did not scale or pre-process images in any other way. Lastly, we used the same code to extract answers and evaluate them using the same code for every considered model. This ensures that we are fair in assessing the quality of their answers.
The objective of this benchmarking setup is to assess the performance of models in speech and audio understanding tasks as utilized by regular users. To accomplish this, we selected several state-of-the-art open-sourced and closed-sourced models and performed evaluations across a variety of public and in-house benchmarks. These benchmarks encompass diverse and challenging topics, including Automatic Speech Recognition (ASR), Automatic Speech Translation (AST), Spoken Query Question Answering (SQQA), Audio Understanding (AU), and Speech Summarization. The results are derived from evaluations conducted on identical test data without any further clarifications. All results were obtained without sampling during inference. For an accurate comparison, we employed consistent prompts for models across different tasks, except for certain model APIs (e.g., GPT-4o), which may refuse to respond to specific prompts for some tasks. In conclusion, we used uniform code to extract answers and evaluate them for all considered models. This approach ensured fairness by assessing the quality of their responses.
The model was evaluated across a breadth of public and internal benchmarks to understand it's capabilities under multiple tasks and conditions. While most evaluations use English, multilingual benchmark was incorporated to cover performance in select languages. More specifically,
Vision:
Speech:
Safety and RAI: