InternOmni: Extending InternVL with Audio Modality
[🔙 Go Back][🆕 Github][📜 InternVL 1.0 Paper] [📜 InternVL 1.5 Paper] [🗨️ Chat Demo] [🤗 HF Demo] [ ModelScope] [🚀 Quick Start] [📖 Document]
Type | Model | Date | Download | Note |
---|---|---|---|---|
Multimodal Large Language Models | InternOmni | 2024.07.25 | 🤗 HF link | Extending InternVL's Modalities to Audio with Good Performance |
Vision Foundation Model | InternViT-300M-448px | 2024.05.25 | 🤗 HF link | Distilled small vision foundation model with 300M parameters. |
Audio Foundation Model | Whisper-large-v3 | 2024.07.25 | 🤗 HF link | Pre-trained model for ASR and speech translation |
InternOmni
Method
We introduce InternOmni, an open-source multimodal large language model that adds audio input to the existing InternVL series. The goal is to enhance the modality of the InternVL series models and move further toward general artificial intelligence while providing users with a better experience. We employ the following designs:
- Strong Vision Encoder: Building on the previously applied vision model InternViT-6B, we utilized distillation to create a lightweight vision foundation model, InternViT-300M. This enhances visual understanding while reducing the model size.
- Efficient Audio Encoder: We adopted OpenAI's open-source Whisper-large-v3 model, which has been trained on a large amount of audio data and has strong capabilities in speech recognition and translation.
- High-Quality Audio-Image Dataset: We carefully collected a high-quality audio-image dataset that covers common scenes and document images. This dataset is used for audio question-answering training on images, improving the model's performance in audio question-answering tasks.
Model Card
Name | InternOmni | |
---|---|---|
Resolution | 448 × 448 | |
Training Data | To ensure the proper alignment of audio data, we train on approximately 26 million data points, including datasets like GigaSpeech, CommonVoice, Libriheavy, and WENETSPEECH. The format used is: audio+text => text. At this stage, we freeze the ViT and its MLP, only keeping the audio-related components active. | |
Trainable Module | MLP_audio | |
Trainable Cost | 64 GPUs, 4k steps, approximately 30 hours. | |
Stage-2 | Training Data | We train on approximately 1.9 million open-source image-text instruction datasets, replacing the original text with audio. These datasets include TextVQA, GQA, OKVQA, ALLAVA, and others. The format used is audio+image => text. At this stage, we freeze both the ViT and Whisper encoders, only keeping the MLP layers used for alignment active |
Trainable Module | MLP_audio | |
Trainable Cost | 32 GPUs, 3k steps, approximately 15 hours. |
Performance
InternVL Omni did not use VL data for training in both the alignment and SFT stages; instead, it was trained entirely with audio data. However, it retained InternVL's powerful capabilities in handling complex image-text data, excelling in tasks such as scientific charts, general charts, documents, infographics, and OCR.
name | MMMU (val) |
MathVista (testmini) |
AI2D (test) |
ChartQA (test) |
DocVQA (test) |
InfoVQA (test) |
OCRBench | MMB-EN (test) |
MMB-CN (test) |
---|---|---|---|---|---|---|---|---|---|
GPT-4V* (20240409) |
63.1 / 61.7 | 58.1 | 89.4 | 78.1 | 87.2 | - | 678 | 81.0 | 80.2 |
Gemini Pro 1.5* | 58.5 / 60.6 | 57.7 | 80.3 | 81.3 | 86.5 | 72.7 | 754 | 73.9 | 73.8 |
Claude3.5-Sonnet* | 68.3 / 65.9 | 67.7 | 94.7 | 90.8 | 95.2 | - | 788 | 79.7 | 80.7 |
GPT-4o* (20240513) |
69.1 / 69.2 | 63.8 | 94.2 | 85.7 | 92.8 | - | 736 | 83.4 | 82.1 |
Cambrian-1 | 49.7 / 50.4 | 53.2 | 79.7 | 75.6 | 75.5 | - | 600 | 81.4 | - |
LLaVA-NeXT Qwen1.5 | 50.1 | 49.0 | 80.4 | 79.7 | 85.7 | - | - | 80.5 | - |
|
58.9 / 62.0 | 66.3 | 87.3 / 96.0 | 87.1 | 95.1 | 83.3 | 837 | 87.8 | 87.2 |
InternVL2-8B | 49.3 / 51.2 | 58.3 | 83.8 | 83.3 | 91.6 | 74.8 | 794 | 81.7 | 81.2 |
|
49.3 / 51.2 | 58.3 | 83.8 | 83.3 | 91.6 | 74.8 | 794 | 81.7 | 81.2 |
name | DocVQA_audio (val) |
AI2D_audio (test) |
Chartvqa_audio (human) |
Chartvqa_audio (augment) |
Textvqa_audio (val) |
Infovqa_audio (val) |
---|---|---|---|---|---|---|
InternOmni | 79.94 | 53.92 | 56.08 | 76.48 | 69.07 | 60.34 |
- We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for AI2D, ChartQA, DocVQA, InfoVQA, MMBench were tested using the InternVL repository. MathVista and OCRBench were evaluated using the VLMEvalKit.
- For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
- Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
Benchmark
Existing audio benchmarks mainly focus on the audio itself, with relatively simple questions and images. To better evaluate the model's ability to handle complex audio-image pair problems, especially those involving tables and mathematical knowledge, I transcribed the text parts of some existing complex image-text VQA datasets and converted them into audio files. From these, I selected the more challenging data from the original image-text questions, creating a 15k audio-image question-answer dataset. The benchmark will be open-sourced soon.
Data
Type | Size | Download |
---|---|---|
Pretrain Data | 26M | 🤗 HF link |
STF Data | 1.9M | TBD |
Benchmark | 15k | 🤗 HF link |
Examples
Citation
@article{chen2024far,
title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
journal={arXiv preprint arXiv:2404.16821},
year={2024}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={24185--24198},
year={2024}
}