image


We introduce InternVL2, currently the most powerful open-source Multimodal Large Language Model (MLLM). The InternVL2 family includes models ranging from a 1B model, suitable for edge devices, to a 108B model, which is significantly more powerful. With larger-scale language models, InternVL2-Pro demonstrates outstanding multimodal understanding capabilities, matching the performance of commercial closed-source models across various benchmarks.

InternVL2 family is built upon the following designs:

  1. Progressive with larger language models: We introduce a progressive alignment training strategy, resulting in the first vision foundation model natively aligned with large language models. By employing the progressive training strategy where the model scales from small to large while the data refines from coarse to fine, we have completed the training of large models at a relatively low cost. This approach has demonstrated excellent performance with limited resources.
  2. Multimodal input: With one set of parameters, our model supports multiple modalities of input, including text, images, video, and medical data.
  3. Multitask output: Powered by our recent work VisionLLMv2, our model supports various output formats, such as images, bounding boxes, and masks, demonstrating extensive versatility. By connecting the MLLM with multiple downstream task decoders, InternVL2 can be generalized to hundreds of vision-language tasks while achieving performance comparable to expert models.

Model Card

Name InternVL2-2B InternVL2-4B InternVL2-8B InternVL2-26B InternVL2-40B InternVL2-108B
Model Size Total 2.21B 4.15B 8.08B 25.51B 40.07B 108.70B
ViT 304.01M 304.01M 304.01M 5.54B 5.54B 5.54B
MLP 12.60M 22.03M 33.57M 116.43M 143.17M 172.01M
LLM 2.21B 3.82B 7.74B 19.86B 34.39B 102.99B
Resolution dynamic resolution, max to 12 tiles of 448 Γ— 448 in training, max to 40 tiles in testing (4K resolution).
Stage-1 Training Data We entend the pre-training dataset used in InternVL 1.5 with data collected from diverse sources. These datasets span multiple tasks, including captioning, visual question answering, detection, grounding, and OCR. The OCR datasets were constructed using PaddleOCR to perform OCR on Chinese images from Wukong and on English images from LaionCOCO, and were manually verified. Besides, we also crawled and manually parsed the exam data from uworld, kaptest, testbank, aga, and sat. The interleaved data from OmniCorpus was also utilized.
Trainable Module MLP
Stage-2 Training Data We constructed the training data based on the 5M high-quality bilingual dataset used in InternVL 1.5. Specifically, we included video data such as EgoTaskQA, Mementos, STAR, NTU RGB+D, VideoChat2IT, and LSMDC-QA, as well as medical data such as Medical-Diff-VQA, Pathology-VQA, PMC-CaseReport, PMC-VQA, Slake, and VQA-RAD. We also included SROIE, FUNSD, and POIE to further enhance the model's ability to recognize handwritten fonts. Additionally, we excluded all data from ShareGPT-4V and replace it with data from ShareGPT-4o.
Trainable Module ViT + MLP + LLM

Performance

InternVL2 demonstrates powerful capabilities in handling complex multimodal data, excelling in tasks such as mathematics, scientific charts, general charts, documents, infographics, and OCR. For instance, InternVL2 achieves an accuracy of 66.3% on the MathVista benchmark, significantly surpassing other closed-source commercial models and open-source models. Moreover, InternVL2 achieves state-of-the-art performance across a wide range of benchmarks, including the general chart benchmark ChartQA, the document benchmark DocVQA, the infographic benchmark InfographicVQA, and the general visual question answering benchmark MMBench.

Notably, there are two evaluation settings in the AI2D benchmark. In the first setting, we replace the content within the rectangles in the images with the letters of the options. In the second setting, we replace the content within the rectangles with both the letters of the options and the value of the options. Our model achieves a performance of 87.3 in the first setting and 96.0 in the second setting.

* Proprietary Model

name MMMU
(val)
MathVista
(testmini)
AI2D
(test)
ChartQA
(test)
DocVQA
(test)
InfoVQA
(test)
OCRBench MMB-EN
(test)
MMB-CN
(test)
OpenCompass
(avg score)
GPT-4V*
(20240409)
63.1 / 61.7 58.1 89.4 78.1 87.2 - 678 81.0 80.2 63.5
Gemini Pro 1.5* 58.5 / 60.6 57.7 80.3 81.3 86.5 72.7 754 73.9 73.8 64.4
Claude3.5-Sonnet* 68.3 / 65.9 67.7 94.7 90.8 95.2 - 788 79.7 80.7 67.9
GPT-4o*
(20240513)
69.1 / 69.2 63.8 94.2 85.7 92.8 - 736 83.4 82.1 69.9
Cambrian-1 49.7 / 50.4 53.2 79.7 75.6 75.5 - 600 81.4 - 58.3
LLaVA-NeXT Qwen1.5 50.1 49.0 80.4 79.7 85.7 - - 80.5 - -
InternVL2-Pro 58.9 / 62.0 66.3 87.3 / 96.0 87.1 95.1 83.3 837 87.8 87.2 71.8
name MMMU
(val)
MathVista
(testmini)
AI2D
(test)
ChartQA
(test)
DocVQA
(test)
InfoVQA
(test)
OCRBench MMB-EN
(test)
MMB-CN
(test)
OpenCompass
(avg score)
InternVL2-1B 35.4 / 36.7 37.7 64.1 72.9 81.7 50.9 754 65.4 60.7 48.3
InternVL2-2B 34.3 / 36.3 46.3 74.1 76.2 86.9 58.9 784 73.2 70.9 54.0
InternVL2-4B 47.0 / 48.3 58.6 78.9 81.5 89.2 67.0 788 78.6 73.9 60.6
InternVL2-8B 49.3 / 51.2 58.3 83.8 83.3 91.6 74.8 794 81.7 81.2 64.1
InternVL2-26B 48.3 / 50.7 59.4 84.5 84.9 92.9 75.9 825 83.4 82.0 66.4
InternVL2-40B 53.9 / 55.2 63.7 87.1 86.2 93.9 78.7 837 86.8 86.5 69.7
InternVL2-Llama3-76B 55.2 / 58.2 65.5 87.6 88.4 94.1 82.0 839 86.5 86.3 71.0
InternVL2-Pro 58.9 / 62.0 66.3 87.3 / 96.0 87.1 95.1 83.3 837 87.8 87.2 71.8
  • We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for AI2D, ChartQA, DocVQA, InfoVQA, MMBench were tested using the InternVL repository. MathVista and OCRBench were evaluated using the VLMEvalKit.
  • For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
  • Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.

In addition to the VQA benchmarks mentioned above, we also evaluate our InternVL2-Pro in the MM-NIAH benchmark, which is a comprehensive benchmark designed for long multimodal documents understanding. As shown in the figure below, our model with Retrieval Augmented Generation(RAG) exhibits comparable performance to Gemini on comprehending long multimodal documents. Improving performance on the counting task and other tasks involving image needles will be left for future work. See this paper for more details about the InternVL2-Pro augmented with RAG.

image/png

Examples

Citation


  @article{chen2023internvl,
      title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
      author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
      journal={arXiv preprint arXiv:2312.14238},
      year={2023}
  }
  @article{chen2024far,
    title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
    author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
    journal={arXiv preprint arXiv:2404.16821},
    year={2024}
  }
  

πŸ”™ Go Back