Type Model Date Download Note
Multimodal Large Language Models InternOmni 2024.07.25 🤗 HF link Extending InternVL's Modalities to Audio with Good Performance
Vision Foundation Model InternViT-300M-448px 2024.05.25 🤗 HF link Distilled small vision foundation model with 300M parameters.
Audio Foundation Model Whisper-large-v3 2024.07.25 🤗 HF link Pre-trained model for ASR and speech translation

InternOmni

Method

We introduce InternOmni, an open-source multimodal large language model that adds audio input to the existing InternVL series. The goal is to enhance the modality of the InternVL series models and move further toward general artificial intelligence while providing users with a better experience. We employ the following designs:

  1. Strong Vision Encoder: Building on the previously applied vision model InternViT-6B, we utilized distillation to create a lightweight vision foundation model, InternViT-300M. This enhances visual understanding while reducing the model size.
  2. Efficient Audio Encoder: We adopted OpenAI's open-source Whisper-large-v3 model, which has been trained on a large amount of audio data and has strong capabilities in speech recognition and translation.
  3. High-Quality Audio-Image Dataset: We carefully collected a high-quality audio-image dataset that covers common scenes and document images. This dataset is used for audio question-answering training on images, improving the model's performance in audio question-answering tasks.
Model structure

Model Card

NameInternOmni
Resolution448 × 448
Stage-1Training DataTo ensure the proper alignment of audio data, we train on approximately 26 million data points, including datasets like GigaSpeech, CommonVoice, Libriheavy, and WENETSPEECH. The format used is: audio+text => text. At this stage, we freeze the ViT and its MLP, only keeping the audio-related components active.
Trainable ModuleMLP_audio
Trainable Cost64 GPUs, 4k steps, approximately 30 hours.
Stage-2Training DataWe train on approximately 1.9 million open-source image-text instruction datasets, replacing the original text with audio. These datasets include TextVQA, GQA, OKVQA, ALLAVA, and others. The format used is audio+image => text. At this stage, we freeze both the ViT and Whisper encoders, only keeping the MLP layers used for alignment active
Trainable ModuleMLP_audio
Trainable Cost32 GPUs, 3k steps, approximately 15 hours.

Performance

InternVL Omni did not use VL data for training in both the alignment and SFT stages; instead, it was trained entirely with audio data. However, it retained InternVL's powerful capabilities in handling complex image-text data, excelling in tasks such as scientific charts, general charts, documents, infographics, and OCR.

name MMMU
(val)
MathVista
(testmini)
AI2D
(test)
ChartQA
(test)
DocVQA
(test)
InfoVQA
(test)
OCRBench MMB-EN
(test)
MMB-CN
(test)
GPT-4V*
(20240409)
63.1 / 61.7 58.1 89.4 78.1 87.2 - 678 81.0 80.2
Gemini Pro 1.5* 58.5 / 60.6 57.7 80.3 81.3 86.5 72.7 754 73.9 73.8
Claude3.5-Sonnet* 68.3 / 65.9 67.7 94.7 90.8 95.2 - 788 79.7 80.7
GPT-4o*
(20240513)
69.1 / 69.2 63.8 94.2 85.7 92.8 - 736 83.4 82.1
Cambrian-1 49.7 / 50.4 53.2 79.7 75.6 75.5 - 600 81.4 -
LLaVA-NeXT Qwen1.5 50.1 49.0 80.4 79.7 85.7 - - 80.5 -
InternVL2-Pro 58.9 / 62.0 66.3 87.3 / 96.0 87.1 95.1 83.3 837 87.8 87.2
InternVL2-8B 49.3 / 51.2 58.3 83.8 83.3 91.6 74.8 794 81.7 81.2
InternOmni 49.3 / 51.2 58.3 83.8 83.3 91.6 74.8 794 81.7 81.2
name DocVQA_audio
(val)
AI2D_audio
(test)
Chartvqa_audio
(human)
Chartvqa_audio
(augment)
Textvqa_audio
(val)
Infovqa_audio
(val)
InternOmni 79.94 53.92 56.08 76.48 69.07 60.34
  • We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for AI2D, ChartQA, DocVQA, InfoVQA, MMBench were tested using the InternVL repository. MathVista and OCRBench were evaluated using the VLMEvalKit.
  • For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
  • Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.

Benchmark

Existing audio benchmarks mainly focus on the audio itself, with relatively simple questions and images. To better evaluate the model's ability to handle complex audio-image pair problems, especially those involving tables and mathematical knowledge, I transcribed the text parts of some existing complex image-text VQA datasets and converted them into audio files. From these, I selected the more challenging data from the original image-text questions, creating a 15k audio-image question-answer dataset. The benchmark will be open-sourced soon.

Data

Type Size Download
Pretrain Data 26M 🤗 HF link
STF Data 1.9M TBD
Benchmark 15k 🤗 HF link

Examples

Input:

[音频文本] Is it possible to make a left turn from this lane?

Input Image

InternOmni:

No, the lane you're currently in seems to be a straight-only lane. However, there is a left turn signal at the next intersection, indicating that you can make a left turn from your current position.

Input:

[音频文本] 写一个寓言故事

Input Image

InternOmni:

从前,有一个年轻人叫李云。他非常勤奋好学,但是总是没有机会展示自己的才华和能力。有一天,他听说附近的山上有一只神奇的鸡,只要能够得到它的羽毛,就能获得成功和财富。于是,他决定去寻找这只神奇的鸡。\n在山脚下,李云遇到了一只真正的公鸡。它站在一块岩石上,似乎正在休息或者观察周围的环境。李云试图接近那只公鸡,但每次当他靠近时,公鸡都会发出一声响亮的啼叫并飞走。尽管如此,李云并没有放弃他的目标。他一直跟着公鸡爬山,直到他们到达山顶。那里有另一块大石头和一棵树。当李云走到那块石头的旁边时,他看到了一只小鸟坐在上面。这只鸟看起来像是一只鹦鹉,因为它有着鲜艳的黄色、绿色和蓝色羽毛。然而,这并不是传说中的神奇之鸟——而是另一只普通的鸟类。失望之余,李云意识到自己已经走了很远的路程,但他仍然继续攀登着这座山峰。最终,他在山的另一边找到了一座悬崖峭壁上的洞穴。在那里,他发现了一个神秘的宝箱。这个箱子被锁住了,并且钥匙就在附近的树上挂着。一旦打开盒子,里面装满了金币和珠宝!李云感到非常高兴,因为他终于实现了自己的梦想。他把所有的钱都带回家,然后开始了新的生活。从那天起,他就成为了当地最富有的人之一,而他也明白到:只有不断努力,才有可能获得成功。

Citation


  @article{chen2024far,
    title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
    author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
    journal={arXiv preprint arXiv:2404.16821},
    year={2024}
  }

  @inproceedings{chen2024internvl,
    title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
    author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={24185--24198},
    year={2024}
  }


🔙 Go Back