image


We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series.

InternVL3 family is built upon the following designs:

  1. Variable Visual Position Encoding: We integrates the Variable Visual Position Encoding (V2PE) , which utilizes smaller, more flexible position increments for visual tokens. This modification facilitates the handling of longer multimodal contexts without excessively extending the position window.
  2. Native Multimodal Pre-Training: We propose a Native Multimodal Pre-Training approach that consolidates language pre-training and multi-modal alignment training into a single pre-training stage. Unlike conventional paradigmsโ€”where a language-only large model is first trained (typically with language pre-training followed by language post-training) and later adapted to accommodate additional modalitiesโ€”our method performs integrated optimization by interleaving multimodal data (e.g., imageโ€“text, videoโ€“text, or interleaved imageโ€“text sequences) with large-scale textual corpora during the pre-training process. This unified training scheme allows the pre-trainied model to learn both linguistic and multimodal capabilities simultaneously, ultimately enhancing its capability to handle vision-language tasks without introducing additional bridging modules or subsequent inter-model alignment procedures.
  3. Mixed Preference Optimization: During Pre-training and SFT, the model is trained to predict the next token conditioned on previous ground-truth tokens. However, during inference, the model predicts each token based on its own prior outputs. This discrepancy between ground-truth tokens and model-predicted tokens introduces a distribution shift, which can impair the modelโ€™s Chain-of-Thought (CoT) reasoning capabilities. To mitigate this issue, we employ Mixed Preference Optimization (MPO) , which introduces additional supervision from both positive and negative samples to align the model response distribution with the ground-truth distribution, thereby improving reasoning performance.
  4. Test-Time Scaling with VisualPRM: Test-Time Scaling has been shown to be an effective method to enhance the reasoning abilities of LLMs and MLLMs. In this work, we use the Best-of-N evaluation strategy and employ VisualPRM-8B as the critic model to select the best response for reasoning and mathematics evaluation.

image

The architecture of InternVL3 follows the same general framework as its predecessors, adhering to the "ViT-MLP-LLM" paradigm. As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448ร—448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data. Notably, in InternVL3, we integrate the Variable Visual Position Encoding (V2PE) , which utilizes smaller, more flexible position increments for visual tokens. Benefiting from V2PE, InternVL3 exhibits better long context understanding capabilities compared to its predecessors.

Model Card

Name InternVL3-1B InternVL3-2B InternVL3-8B InternVL3-9B InternVL3-14B InternVL3-38B InternVL3-78B
Model Size Total 938.19M 2.09B 7.94B 9.14B 15.12B 38.39B 78.41B
ViT 304.01M 304.01M 304.01M 304.01M 304.01M 5.54B 5.54B
MLP 4.48M 8.66M 27.54M 33.57M 47.20M 91.79M 172.01M
LLM 629.70M 1.78B 7.61B 8.80B 14.77B 32.76B 72.70B
Resolution dynamic resolution, max to 36 tiles of 448 ร— 448 in training, max to 128 tiles in testing.
Native Multimodal Pre-Training Training Data The pre-training data used in InternVL3 can be broadly categorized into two types: multimodal data and pure language data. The multimodal data comprises a combination of existing high-quality datasets and newly collected real-world data. Specifically, we leverage the pre-training data from InternVL2.5, which covers a diverse range of domains such as image captioning, general question answering, mathematics, charts, optical character recognition (OCR), knowledge grounding, document understanding, multi-turn dialogue, and medical data. Although the overall data scale was not increased, the utility of this dataset was significantly improved by updating not only the weights of the MLP module but also those of the Vision Transformer (ViT) and the large language model (LLM) components. In addition, to enhance the model's ability to generalize to practical applications, we supplement this with newly collected data from real-world tasks, including graphical user interfaces (GUI) tasks, tool usage, 3D scene understanding, and video comprehension.
To compensate for the relatively short and less diverse textual content typically found in multimodal datasets, we incorporate pure language data into the pre-training process. This helps preserve and enhance the modelโ€™s capabilities in language understanding and generation. The language corpus is constructed based on the pre-training data of InternLM2.5, and is further enriched with several open-source text datasets to improve the modelโ€™s performance in knowledge-intensive, mathematical, and reasoning tasks
Trainable Module ViT + MLP + LLM
Supervised Fine-Tuning Training Data For SFT data, we construct the training corpora based on those used in InternVL2.5 while introducing additional tool usage, 3D scene understanding, GUI operations, scientific diagrams, creative writing, and multimodal reasoning samples. As a result, the number of training samples grows from 16.3M in InternVL2.5 to 21.7M in InternVL3.
Trainable Module ViT + MLP + LLM
Mixed Preference Optimization Training Data For MPO data, we construct preference pairs based on the data pipeline and samples proposed in MMPR v1.2 , which cover a wide range of domains, including general visual question answering (VQA), science, chart, mathematics, OCR, and document. We use the SFT versions of InternVL3-8B, 38B, and 78B to generate rollouts. During the MPO phase, all models are trained on the same dataset, which comprises about 300K samples.
Trainable Module ViT + MLP + LLM

Performance

Multimodal Reasoning and Mathematics

Model MMMU MathVista MathVision MathVerse DynaMath WeMath LogicVista Overall
LLaVA-OV-0.5B 31.4 34.8 - - - - - -
InternVL2.5-1B 41.2 47.1 21.1 16.4 5.6 11.1 26.0 24.1
InternVL3-1B 43.4 45.8 18.8 18.7 5.8 13.4 29.8 25.1
w/ VisualPRM-Bo8 55.4 62.1 21.7 28.9 13.4 28.5 34.9 35.0
Aquila-VL-2B 46.9 59.1 17.9 17.4 5.0 15.9 30.6 27.5
Qwen2.5-VL-3B 51.2 61.2 21.9 31.2 13.2 22.9 40.3 34.6
Ovis-2B 45.6 64.1 17.7 29.4 10.0 9.9 34.7 30.2
Ovis-4B 49.0 69.6 21.5 38.5 18.0 16.9 35.3 35.5
InternVL2.5-2B 43.2 51.1 14.0 22.3 4.4 8.0 27.3 24.3
InternVL2.5-4B 51.8 64.1 18.4 27.7 15.2 21.2 34.2 33.2
InternVL3-2B 48.6 57.0 21.7 25.3 14.6 22.4 36.9 32.4
w/ VisualPRM-Bo8 57.8 70.5 26.6 36.7 21.4 38.5 40.5 41.7
LLaVA-OV-7B 47.9 58.6 18.3 19.3 9.0 20.9 33.3 29.6
MiniCPM-V2.6 49.8 60.8 23.4 18.9 9.8 16.4 27.5 29.5
MiniCPM-o2.6 50.9 73.3 21.7 35.0 10.4 25.2 36.0 36.1
Ovis-8B 57.4 71.8 25.9 42.3 20.4 27.2 39.4 40.6
Qwen2.5-VL-8B 55.0 67.8 25.4 41.1 21.0 35.2 44.1 41.4
InternVL2.5-8B 56.2 64.5 17.0 22.8 9.4 23.5 36.0 32.8
InternVL3-8B 62.7 71.6 29.3 39.8 25.5 37.1 44.1 44.3
w/ VisualPRM-Bo8 66.0 75.2 37.5 46.3 28.5 48.1 49.7 50.2
InternVL3-9B 57.7 71.5 27.6 35.3 26.7 33.8 49.2 43.1
w/ VisualPRM-Bo8 63.7 76.2 33.9 45.8 29.1 46.6 50.6 49.4
Ovis2-16B 60.7 73.7 30.1 45.8 26.3 45.0 47.4 47.0
InternVL2.5-26B 60.7 68.2 23.4 24.0 11.4 30.9 39.6 36.9
InternVL3-14B 67.1 75.1 37.2 44.4 31.3 43.0 51.2 49.9
w/ VisualPRM-Bo8 69.3 77.9 40.1 47.7 33.1 52.0 56.2 53.8
Cambrian-34B 49.7 53.2 - - - - - -
VILA-1.5-40B 55.1 49.5 - - - - - -
Ovis2-34B 66.7 76.1 31.9 50.1 27.5 51.9 49.9 50.6
InternVL2.5-38B 63.9 71.9 32.2 36.9 20.0 38.3 47.9 44.4
InternVL3-38B 70.1 75.1 34.2 48.2 35.3 48.6 58.4 52.8
w/ VisualPRM-Bo8 71.0 79.4 41.8 54.2 36.1 55.2 58.4 56.6
GPT-4o-20241120 70.7 60.0 31.2 40.6 34.5 45.8 52.8 47.9
Claude-3.7-Sonnet 75.0 66.8 41.9 46.7 39.7 49.3 58.2 53.9
Gemini-2.0-Flash 72.6 70.4 43.6 47.8 42.1 47.4 52.3 53.7
Gemini-2.0-Pro 69.9 71.3 48.1 67.3 43.3 56.5 53.2 58.5
LLaVA-OV-72B 55.7 67.1 25.3 27.2 15.6 32.0 40.9 37.7
QvQ-72B-Preview 70.3 70.3 34.9 48.2 30.7 39.0 58.2 50.2
Qwen2.5-VL-72B 68.2 74.2 39.3 47.3 35.9 49.1 55.7 52.8
InternVL2.5-78B 70.0 72.3 32.2 39.2 19.2 39.8 49.0 46.0
InternVL3-78B 72.2 79.0 43.1 51.0 35.1 46.1 55.9 54.6
w/ VisualPRM-Bo8 72.2 80.5 40.8 54.2 37.3 52.4 57.9 56.5

OCR, Chart, and Document Understanding

Model Name AI2D
(w./wo Mask)
ChartQA
(test avg.)
TextVQA
(val)
DocVQA
(test)
InfoVQA
(test)
OCR
Bench
SEED-2
Plus
CharXiv
(RQ/DQ)
VCR-EN-Easy
(EM/Jaccard)
Overall
LLaVA-OneVision-0.5B 57.1 / \lsp 61.4 - 70.0 41.8 565 - - - -
InternVL2-1B 64.1 / 70.5 72.9 70.5 81.7 50.9 754 54.3 18.1 / 30.7 21.5 / 48.4 54.9
InternVL2.5-1B 69.3 / 77.8 75.9 72.0 84.8 56.0 785 59.0 19.0 / 38.4 91.5 / 97.0 68.3
InternVL3-1B 69.4 / 78.3 75.3 74.1 81.9 53.7 790 58.2 21.0 / 47.1 89.3 / 96.2 68.6
Qwen2-VL-2B 74.7 / 84.6 73.5 79.7 90.1 65.5 809 62.4 - 81.5 / \lsp -
Qwen2.5-VL-3B 81.6 / \lsp 84.0 79.3 93.9 77.1 797 67.6 31.3 / 58.6 - -
Aquila-VL-2B 75.0 / \lsp 76.5 76.4 85.0 58.3 772 63.0 - 70.0 / \lsp -
InternVL2-2B 74.1 / 82.3 76.2 73.4 86.9 58.9 784 60.0 21.0 / 40.6 32.9 / 59.2 62.0
InternVL2.5-2B 74.9 / 83.5 79.2 74.3 88.7 60.9 804 60.9 21.3 / 49.7 93.2 / 97.6 72.1
InternVL3-2B 78.7 / 87.4 80.2 77.0 88.3 66.1 835 64.6 28.3 / 54.7 91.2 / 96.9 74.7
Ovis1.6-Gemma2-9B 84.4 / \lsp - - - - 830 - - - -
MiniCPM-V2.6 82.1 / \lsp 82.4 80.1 90.8 - 852 65.7 31.0 / 57.1 73.9 / 85.7 -
Molmo-7B-D \rsp / 93.2 84.1 81.7 92.2 72.6 694 - - - -
Qwen2-VL-7B 83.0 / 92.1 83.0 84.3 94.5 76.5 866 69.0 - 89.7 / 93.8 -
Qwen2.5-VL-7B 83.9 / \lsp 87.3 84.9 95.7 82.6 864 70.4 42.5/73.9 - -
InternVL2-8B 83.8 / 91.7 83.3 77.4 91.6 74.8 794 67.5 31.2 / 56.1 37.9 / 61.5 69.7
InternVL2.5-8B 84.5 / 92.8 84.8 79.1 93.0 77.6 822 69.7 32.9 / 68.6 92.6 / 97.4 79.6
InternVL3-8B 85.2 / 92.6 86.6 80.2 92.7 76.8 880 69.7 37.6 / 73.6 94.5 / 98.1 81.3
InternVL3-9B 84.6 / 92.9 86.2 79.4 93.6 79.6 877 68.8 38.0 / 72.5 94.2 / 97.9 81.3
InternVL3-14B 86.0 / 93.7 87.3 80.5 94.1 83.6 875 70.3 43.1 / 82.2 94.8 / 98.2 83.4
InternVL-Chat-V1.5 80.7 / 89.8 83.8 80.6 90.9 72.5 724 66.3 29.2 / 58.5 14.7 / 51.4 65.9
InternVL2-26B 84.5 / 92.5 84.9 82.3 92.9 75.9 825 67.6 33.4 / 62.4 74.5 / 86.7 76.7
InternVL2.5-26B 86.4 / 94.4 87.2 82.4 94.0 79.8 852 70.8 35.9 / 73.5 94.4 / 98.0 81.8
Qwen2.5-VL-32B - - - 94.8 83.4 - - - - -
Cambrian-34B 79.5 / \lsp 75.6 76.7 75.5 46.0 600 - 27.3 / 59.7 79.7 / 89.3 -
VILA-1.5-40B 69.9 / \lsp 67.2 73.6 - - 460 - 24.0 / 38.7 - -
InternVL2-40B 86.6 / 94.5 86.2 83.0 93.9 78.7 837 69.2 32.3 / 66.0 84.7 / 92.6 79.3
InternVL2.5-38B 87.6 / 95.1 88.2 82.7 95.3 83.6 842 71.2 42.4 / 79.6 94.7 / 98.2 83.6
InternVL3-38B 88.9 / 95.5 89.2 83.9 95.4 85.0 886 71.6 46.4 / 87.2 96.1 / 98.7 85.5
GPT-4V 78.2 / 89.4 78.5 78.0 88.4 75.1 645 53.8 37.1 / 79.9 52.0 / 65.4 70.0
GPT-4o-20240513 84.6 / 94.2 85.7 77.4 92.8 79.2 736 72.0 47.1 / 84.5 91.6 / 96.4 81.6
Claude-3-Opus 70.6 / 88.1 80.8 67.5 89.3 55.6 694 44.2 30.2 / 71.6 62.0 / 77.7 67.3
Claude-3.5-Sonnet 81.2 / 94.7 90.8 74.1 95.2 74.3 788 71.7 60.2 / 84.3 63.9 / 74.7 78.7
Gemini-1.5-Pro 79.1 / 94.4 87.2 78.8 93.1 81.0 754 - 43.3 / 72.0 62.7 / 77.7 -
LLaVA-OneVision-72B 85.6 / \lsp 83.7 80.5 91.3 74.9 741 - - - -
NVLM-D-72B 85.2 / 94.2 86.0 82.1 92.6 - 853 - - - -
Molmo-72B \rsp / 96.3 87.3 83.1 93.5 81.9 - - - - -
Qwen2-VL-72B 88.1 / \lsp 88.3 85.5 96.5 84.5 877 - - 91.3 / 94.6 -
Qwen2.5-VL-72B 88.7 / \lsp 89.5 83.5 96.4 87.3 885 73.0 49.7 / 87.4 - -
InternVL2-Llama3-76B 87.6 / 94.8 88.4 84.4 94.1 82.0 839 69.7 38.9 / 75.2 83.2 / 91.3 81.1
InternVL2.5-78B 89.1 / 95.7 88.3 83.4 95.1 84.1 854 71.3 42.4 / 82.3 95.7 / 94.5 83.9
InternVL3-78B 89.7 / 96.0 89.7 84.3 95.4 86.5 906 71.9 46.0 / 85.1 96.0 / 98.6 85.8

Multi-Image & Real-World Comprehension

Model Name BLINK
(val)
Mantis
Eval
MMIU Muir
Bench
MMT
(val)
MIRB
(avg)
Overall
LLaVA-OneVision-0.5B 52.1 39.6 - 25.5 - - -
InternVL2-1B 38.6 46.1 37.3 29.3 49.5 31.5 38.7
InternVL2.5-1B 42.0 51.2 38.5 29.9 50.3 35.6 41.3
InternVL3-1B 42.9 50.2 39.3 31.2 52.9 36.1 42.1
Qwen2-VL-2B 44.4 - - - 55.1 - -
Qwen2.5-VL-3B 47.6 - - 47.7 - - -
InternVL2-2B 43.8 48.4 39.8 32.5 50.4 32.1 41.2
InternVL2.5-2B 44.0 54.8 43.5 40.6 54.5 36.4 45.6
InternVL3-2B 50.3 65.9 43.0 38.8 59.5 42.9 50.1
Qwen2-VL-7B 53.2 - - - 64.0 - -
Qwen2.5-VL-7B 56.4 - - 59.6 - - -
MiniCPM-V2.6 53.0 69.0 - - 60.8 - -
InternVL2-8B 50.9 65.4 42.0 48.7 60.0 50.0 52.8
InternVL2.5-8B 54.8 67.7 46.7 51.1 62.3 52.5 55.9
InternVL3-8B 55.5 70.1 46.8 55.0 65.0 56.8 58.2
InternVL3-9B 58.6 70.1 50.4 51.4 65.4 58.6 59.1
InternVL3-14B 60.3 76.0 50.9 56.2 70.3 59.3 62.2
InternVL-Chat-V1.5 46.6 66.8 37.4 38.5 58.0 50.3 49.6
InternVL2-26B 56.2 69.6 42.6 50.6 60.6 53.7 55.6
InternVL2.5-26B 61.8 75.6 49.4 61.1 66.9 55.7 61.8
InternVL2-40B 57.2 71.4 47.9 54.4 66.2 55.2 58.7
InternVL2.5-38B 63.2 78.3 55.3 62.7 70.0 61.2 65.1
InternVL3-38B 64.0 77.9 57.4 63.8 71.8 62.3 66.2
GPT-4V 54.6 62.7 - 62.3 64.3 53.1 -
GPT-4o-20240513 68.0 - 55.7 68.0 65.4 - -
Claude-3.5-Sonnet - - 53.4 - - - -
Gemini-1.5-Pro - - 53.4 - 64.5 - -
LLaVA-OneVision-72B 55.4 77.6 - 54.8 - - -
Qwen2-VL-72B - - - - 71.8 - -
Qwen2.5-VL-72B 64.4 - - 70.7 - - -
InternVL2-Llama3-76B 56.8 73.7 44.2 51.2 67.4 58.2 58.6
InternVL2.5-78B 63.8 77.0 55.8 63.5 70.8 61.1 65.3
InternVL3-78B 66.3 79.3 60.4 64.5 73.2 64.3 68.0
Model Name RealWorld
QA
MME-RW
(EN)
WildVision
(win rate)
R-Bench
(dis)
Overall
LLaVA-OneVision-0.5B 55.6 - - - -
InternVL2-1B 50.3 40.2 17.8 55.6 41.0
InternVL2.5-1B 57.5 44.2 43.4 59.0 51.0
InternVL3-1B 58.2 46.0 43.8 60.4 52.1
Qwen2-VL-2B 62.6 - - - -
Qwen2.5-VL-3B 65.4 53.1 - - -
InternVL2-2B 57.3 47.3 31.8 56.8 48.3
InternVL2.5-2B 60.1 48.8 44.2 62.2 53.8
InternVL3-2B 64.3 53.8 48.8 67.5 58.6
Qwen2-VL-7B 70.1 56.5 - 64.0 -
Qwen2.5-VL-7B 68.5 57.4 - - -
MiniCPM-V2.6 65.0 - - - -
InternVL2-8B 64.4 53.5 54.4 67.9 60.1
InternVL2.5-8B 70.1 59.1 62.0 70.1 65.3
InternVL3-8B 70.8 62.0 69.8 74.1 69.2
InternVL3-9B 70.5 61.3 63.8 70.3 66.5
InternVL3-14B 70.7 64.0 69.8 69.3 68.5
InternVL-Chat-V1.5 66.0 49.4 56.6 67.9 60.0
InternVL2-26B 68.3 58.7 62.2 70.1 64.8
InternVL2.5-26B 74.5 61.8 65.2 72.9 68.6
Cambrian-34B 67.8 44.1 - - -
InternVL2-40B 71.8 61.8 63.2 73.3 67.5
InternVL2.5-38B 73.5 64.0 66.4 72.1 69.0
InternVL3-38B 75.6 67.3 71.6 73.3 72.0
GPT-4V 61.4 - 71.8 65.6 -
GPT-4o-20240513 75.4 45.2 80.6 77.7 69.7
Claude-3.5-Sonnet 60.1 51.6 - - -
Gemini-1.5-Pro 67.5 38.2 - - -
LLaVA-OneVision-72B 71.9 - - - -
Qwen2-VL-72B 77.8 - - - -
Qwen2.5-VL-72B 75.7 63.2 - - -
InternVL2-Llama3-76B 72.2 63.0 65.8 74.1 68.8
InternVL2.5-78B 78.7 62.9 71.4 77.2 72.6
InternVL3-78B 78.0 65.4 73.6 77.4 73.6

Comprehensive Multimodal & Hallucination Evaluation

Model Name MME
(sum)
MMB
(EN/CN)
MMBv1.1
(EN)
MMVet
(turbo)
MMVetv2
(0613)
MMStar Overall
LLaVA-OneVision-0.5B 1438.0 61.6 / 55.5 59.6 32.2 - 37.7 -
InternVL2-1B 1794.4 65.4 / 60.7 61.6 32.7 36.1 45.7 51.7
InternVL2.5-1B 1950.5 70.7 / 66.3 68.4 48.8 43.2 50.1 58.9
InternVL3-1B 1934.4 72.6 / 67.9 69.9 59.5 47.5 51.5 61.9
Qwen2-VL-2B 1872.0 74.9 / 73.5 72.2 49.5 - 48.0 -
Qwen2.5-VL-3B 2157 79.1 / 78.1 77.4 61.8 - 55.9 -
InternVL2-2B 1876.8 73.2 / 70.9 70.2 39.5 39.6 50.1 58.0
InternVL2.5-2B 2138.2 74.7 / 71.9 72.2 60.8 52.3 53.7 65.3
InternVL3-2B 2221.2 81.1 / 78.4 78.6 62.2 53.9 60.7 69.8
Qwen2-VL-7B 2326.8 83.0 / 80.5 80.7 62.0 - 60.7 -
Qwen2.5-VL-7B 2347 83.5 / 83.4 82.6 67.1 - 63.9 -
MiniCPM-V2.6 2348.4 81.5 / 79.3 78.0 60.0 - 57.5 -
InternVL2-8B 2210.3 81.7 / 81.2 79.5 54.2 52.3 62.0 69.2
InternVL2.5-8B 2344.1 84.6 / 82.6 83.2 62.8 58.1 62.8 73.2
InternVL3-8B 2415.4 83.4 / 82.2 81.7 81.3 66.3 68.2 77.7
InternVL3-9B 2372.8 83.4 / 82.2 81.7 76.2 65.4 66.3 76.3
InternVL3-14B 2478.3 85.6 / 84.1 83.5 80.2 68.4 68.8 79.0
InternVL-Chat-V1.5 2194.2 82.2 / 82.0 80.3 61.5 51.5 57.3 69.7
InternVL2-26B 2260.7 83.4 / 82.0 81.5 62.1 57.2 61.2 71.8
InternVL2.5-26B 2373.3 85.4 / 85.5 84.2 65.0 60.8 66.5 75.2
Cambrian-34B - 80.4 / 79.2 78.3 53.2 - 54.2 -
InternVL2-40B 2307.5 86.8 / 86.5 85.1 65.5 63.8 65.4 75.7
InternVL2.5-38B 2455.8 86.5 / 86.3 85.5 68.8 62.1 67.9 77.0
InternVL3-38B 2523.6 87.6 / 86.8 86.9 83.9 69.6 71.5 81.5
GPT-4V 1926.6 81.0 / 80.2 80.0 67.5 66.3 56.0 70.7
GPT-4o-20240513 - 83.4 / 82.1 83.1 69.1 71.0 64.7 -
Claude-3-Opus 1586.8 63.3 / 59.2 60.1 51.7 55.8 45.7 55.5
Claude-3.5-Sonnet - 82.6 / 83.5 80.9 70.1 71.8 65.1 -
Gemini-1.5-Pro - 73.9 / 73.8 74.6 64.0 66.9 59.1 -
LLaVA-OneVision-72B 2261.0 85.8 / 85.3 85.0 60.6 - 65.8 -
Qwen2-VL-72B 2482.7 86.5 / 86.6 85.9 74.0 66.9 68.3 78.7
Qwen2.5-VL-72B 2448.0 88.6 / 87.9 88.4 76.2 - 70.8 -
InternVL2-Llama3-76B 2414.7 86.5 / 86.3 85.5 65.7 68.4 67.4 77.2
InternVL2.5-78B 2494.5 88.3 / 88.5 87.4 72.3 65.5 69.5 79.2
InternVL3-78B 2549.8 89.0 / 88.7 87.7 81.3 70.0 72.5 82.0
Model Name HallBench
(avg.)
MMHal
(score)
CRPE
(relation)
POPE
(avg.)
Overall
LLaVA-OneVision-0.5B 27.9 - - - -
InternVL2-1B 34.0 2.25 57.5 87.3 45.3
InternVL2.5-1B 39.0 2.49 60.9 89.9 48.1
InternVL3-1B 41.4 2.59 64.0 90.7 49.7
Qwen2-VL-2B 41.7 - - - -
Qwen2.5-VL-3B 46.3 - 73.6 - -
InternVL2-2B 37.9 2.52 66.3 88.3 48.8
InternVL2.5-2B 42.6 2.94 70.2 90.6 51.6
InternVL3-2B 42.5 3.26 71.5 89.6 51.7
Qwen2-VL-7B 50.6 3.40 74.4 88.1 54.1
Qwen2.5-VL-7B 52.9 - 76.4 - -
MiniCPM-V2.6 48.1 3.60 75.2 87.3 53.6
InternVL2-8B 45.2 3.33 75.8 86.9 52.8
InternVL2.5-8B 50.1 3.65 78.4 90.6 55.7
InternVL3-8B 49.9 3.61 76.3 91.1 55.2
InternVL3-9B 51.2 3.47 75.0 90.4 55.0
InternVL3-14B 55.1 3.49 77.3 90.2 56.5
InternVL-Chat-V1.5 50.3 3.11 75.4 88.4 54.3
InternVL2-26B 50.7 3.55 75.6 88.0 54.5
InternVL2.5-26B 55.0 3.70 79.1 90.6 57.1
Cambrian-34B 41.6 - - - -
InternVL2-40B 56.9 3.75 77.6 88.4 56.7
InternVL2.5-38B 56.8 3.71 78.3 90.7 57.4
InternVL3-38B 57.1 3.77 77.1 90.6 57.1
GPT-4V 46.5 - - - -
GPT-4o-20240513 55.0 4.00 76.6 86.9 55.6
Claude-3-Opus 37.8 - - - -
Claude-3.5-Sonnet 55.5 - - - -
Gemini-1.5-Pro 45.6 - - - -
LLaVA-OneVision-72B 49.0 - - - -
Qwen2-VL-72B 58.1 - - - -
Qwen2.5-VL-72B 55.2 - 79.2 - -
InternVL2-Llama3-76B 55.2 3.83 77.6 89.0 56.4
InternVL2.5-78B 57.4 3.89 78.8 90.8 57.7
InternVL3-78B 59.1 3.85 79.2 90.3 58.1

Visual Grounding

Model Name RefCOCO RefCOCO+ RefCOCOg Overall
val test-A test-B val test-A test-B val test
Grounding-DINO-L 90.6 93.2 88.2 82.8 89.0 75.9 86.1 87.0 86.6
UNINEXT-H 92.6 94.3 91.5 85.2 89.6 79.8 88.7 89.4 88.9
ONE-PEACE 92.6 94.2 89.3 88.8 92.2 83.2 89.2 89.3 89.8
Qwen2.5-VL-3B 89.1 91.7 84.0 82.4 88.0 74.1 85.2 85.7 85.0
InternVL3-1B 85.8 90.1 81.7 76.6 84.1 69.2 82.8 82.6 81.6
InternVL3-2B 89.8 92.6 86.4 84.0 89.2 76.5 87.6 87.2 86.7
Shikra-7B 87.0 90.6 80.2 81.6 87.4 72.1 82.3 82.2 82.9
Ferret-v2-13B 92.6 95.0 88.9 87.4 92.1 81.4 89.4 90.0 89.6
CogVLM-Grounding 92.8 94.8 89.0 88.7 92.9 83.4 89.8 90.8 90.3
MM1.5 - 92.5 86.7 - 88.7 77.8 - 87.1 -
Qwen2-VL-7B 91.7 93.6 87.3 85.8 90.5 79.5 87.3 87.8 87.9
Qwen2.5-VL-7B 90.0 92.5 85.4 84.2 89.1 76.9 87.2 87.2 86.6
TextHawk2 91.9 93.0 87.6 86.2 90.0 80.4 88.2 88.1 88.2
InternVL2-8B 87.1 91.1 80.7 79.8 87.9 71.4 82.7 82.7 82.9
InternVL2.5-8B 90.3 94.5 85.9 85.2 91.5 78.8 86.7 87.6 87.6
InternVL3-8B 92.5 94.6 88.0 88.2 92.5 81.8 89.6 90.0 89.6
InternVL3-9B 91.8 93.2 86.6 86.4 91.0 79.9 88.0 88.5 88.2
InternVL3-14B 92.0 94.4 87.8 87.4 92.1 81.5 88.6 89.3 89.1
Qwen2-VL-72B 93.2 95.3 90.7 90.1 93.8 85.6 89.9 90.4 91.1
Qwen2.5-VL-72B 92.7 94.6 89.7 88.9 92.2 83.7 89.9 90.3 90.3
InternVL2-Llama3-76B 92.2 94.8 88.4 88.8 93.1 82.8 89.5 90.3 90.0
InternVL2.5-78B 93.7 95.6 92.5 90.4 94.7 86.9 92.7 92.2 92.3
InternVL3-38B 93.2 95.1 90.2 89.8 93.2 85.2 91.4 91.5 91.2
InternVL3-78B 93.4 95.4 90.3 90.1 93.8 85.3 91.5 91.5 91.4

Multimodal Multilingual Understanding

Model Name MMMB Multilingual MMBench MTVQA
(avg.)
Overall
en zh pt ar tr ru en zh pt ar tr ru
InternVL2-1B 73.2 67.4 55.5 53.5 43.8 55.2 67.9 61.2 50.8 43.3 31.8 52.7 12.6 40.7
InternVL2.5-1B 78.8 70.2 61.5 55.0 45.3 61.1 72.5 64.7 57.0 43.0 37.8 53.2 21.4 46.0
InternVL3-1B 79.4 70.1 62.3 58.0 47.6 61.9 72.6 66.2 62.3 48.0 39.5 60.3 22.2 47.9
Qwen2-VL-2B 78.3 74.2 72.6 68.3 61.8 72.8 72.1 71.1 69.9 61.1 54.4 69.3 20.0 52.6
Qwen2.5-VL-3B - - - - - - - - - - - - 24.8 -
InternVL2-2B 79.4 71.6 54.0 43.5 46.4 48.1 73.8 69.6 51.4 29.8 31.3 42.3 10.9 39.3
InternVL2.5-2B 81.4 74.4 58.2 48.3 46.4 53.2 76.5 71.6 55.9 37.3 33.9 44.8 21.8 45.2
InternVL3-2B 81.9 78.3 75.4 68.6 62.9 74.6 81.3 77.8 75.9 66.4 59.5 70.7 26.7 57.4
mPLUG-Owl2 67.3 61.0 59.7 45.8 45.4 62.6 66.2 59.4 58.2 37.9 47.7 60.4 - -
Qwen2-VL-7B 83.9 82.4 81.2 79.0 74.7 82.4 81.8 81.6 79.1 75.6 74.5 79.3 25.6 61.6
Qwen2.5-VL-7B - - - - - - - - - - - - 29.2 -
InternVL2-8B 83.4 81.5 76.1 66.3 69.2 75.7 82.9 81.8 76.0 60.5 66.0 74.4 20.9 56.6
InternVL2.5-8B 84.3 83.1 78.6 69.3 71.5 79.5 83.8 83.2 79.4 64.3 67.8 77.3 27.6 60.4
InternVL3-8B 85.1 83.1 82.5 81.6 76.2 83.4 85.5 85.6 83.2 79.2 75.9 82.6 30.2 64.7
InternVL3-9B 84.8 83.7 80.6 69.9 68.5 80.8 86.5 85.2 79.1 64.3 68.3 79.1 27.1 60.7
InternVL3-14B 85.7 84.7 83.1 83.7 79.3 83.6 86.7 85.8 83.2 81.1 80.7 83.8 31.6 66.2
InternVL-Chat-V1.5 82.6 80.8 76.3 65.2 68.6 74.0 81.1 80.2 76.9 56.2 66.7 71.0 20.5 55.7
InternVL2-26B 83.8 81.7 78.0 68.8 69.3 76.3 82.7 81.8 77.8 61.9 69.6 74.4 17.7 56.2
InternVL2.5-26B 86.2 83.8 81.6 73.3 73.7 82.8 86.1 85.5 80.7 67.5 75.0 79.6 28.5 62.6
InternVL2-40B 85.3 84.1 81.1 70.3 74.2 81.4 86.2 85.8 82.8 64.0 74.2 81.8 20.6 59.7
InternVL2.5-38B 86.4 85.1 84.1 84.3 82.8 84.9 87.5 88.6 85.3 84.5 84.0 85.9 31.7 67.4
InternVL3-38B 86.7 85.6 84.5 84.8 82.6 85.1 89.0 89.3 87.1 84.6 84.3 87.4 32.4 68.1
GPT-4V 75.0 74.2 71.5 73.5 69.0 73.1 77.6 74.4 72.5 72.3 70.5 74.8 22.0 56.1
GPT-4o - - - - - - - - - - - - 27.8 -
Gemini-1.0-Pro 75.0 71.9 70.6 69.9 69.6 72.7 73.6 72.1 70.3 61.1 69.8 70.5 - -
Qwen2-VL-72B 86.8 85.3 85.2 84.8 84.2 85.3 86.9 87.2 85.8 83.5 84.4 85.3 30.9 67.2
Qwen2.5-VL-72B - - - - - - - - - - - - 31.7 -
InternVL2-Llama3-76B 85.3 85.1 82.8 82.8 83.0 83.7 87.8 87.3 85.9 83.1 85.0 85.7 22.0 63.9
InternVL2.5-78B 86.3 85.6 85.1 84.8 83.1 85.4 90.0 89.7 87.4 83.3 84.9 86.3 31.9 68.0
InternVL3-78B 87.2 86.6 85.5 86.5 84.6 86.1 89.4 90.3 88.7 86.1 86.6 88.1 32.5 68.9

Video Understanding

Model Name Video-MME
(wo/w. sub.)
MVBench MMBench-Video MLVU
(M-Avg)
LongVideoBench
(val total)
CG-Bench
(long/clue acc.)
Overall
InternVL2-1B 42.9 / 45.4 57.5 1.14 51.6 43.3 - -
InternVL2.5-1B 50.3 / 52.3 64.3 1.36 57.3 47.9 - -
InternVL3-1B 51.0 / 53.0 63.1 1.3 53.0 48.1 24.8 / 39.1 46.9
Qwen2-VL-2B 55.6 / 60.4 63.2 - - - - -
Qwen2.5-VL-3B 61.5 / 67.6 67.0 1.63 68.2 43.3 - -
InternVL2-2B 46.2 / 49.1 60.2 1.30 54.3 46.0 - -
InternVL2.5-2B 51.9 / 54.1 68.8 1.44 61.4 52.0 - -
InternVL3-2B 58.9 / 61.4 70.4 1.42 64.2 55.4 30.8 / 50.7 54.9
VideoChat2-HD 45.3 / 55.7 62.3 1.22 47.9 - - -
MiniCPM-V-2.6 60.9 / 63.6 - 1.70 - 54.9 - -
LLaVA-OneVision-7B 58.2 / \lsp 56.7 - - - - -
Qwen2-VL-7B 63.3 / 69.0 67.0 1.44 - 55.6 - -
Qwen2.5-VL-7B 65.1 / 71.6 69.6 1.79 70.2 45.3 - -
InternVL2-8B 56.3 / 59.3 65.8 1.57 64.0 54.6 - -
InternVL2.5-8B 64.2 / 66.9 72.0 1.68 68.9 60.0 - -
InternVL3-8B 66.3 / 68.9 75.4 1.69 71.4 58.8 38.6 / 55.2 61.4
InternVL3-9B 66.7 / 68.9 74.3 1.69 70.8 62.5 41.1 / 58.0 62.3
InternVL3-14B 70.4 / 73.0 76.6 1.73 73.3 63.9 44.1 / 60.6 64.9
InternVL2-26B 57.0 / 60.2 67.5 1.67 64.2 56.1 - -
InternVL2.5-26B 66.9 / 69.2 75.2 1.86 72.3 59.9 - -
Oryx-1.5-32B 67.3 / 74.9 70.1 1.52 72.3 - - -
Qwen2.5-VL-32B 70.5 / 77.9 - 1.93 - - - -
VILA-1.5-40B 60.1 / 61.1 - 1.61 56.7 - - -
InternVL2-40B 66.1 / 68.6 72.0 1.78 71.0 60.6 - -
InternVL2.5-38B 70.7 / 73.1 74.4 1.82 75.3 63.3 - -
InternVL3-38B 72.7 / 75.0 76.9 1.81 77.8 67.3 46.9 / 62.8 67.5
GPT-4V/4T 59.9 / 63.3 43.7 1.53 49.2 59.1 - -
GPT-4o-20240513 71.9 / 77.2 - 1.63 64.6 66.7 - -
GPT-4o-20240806 - - 1.87 - - 41.8 / 58.3 -
Gemini-1.5-Pro 75.0 / 81.3 - 1.30 - 64.0 40.1 / 56.4 -
VideoLLaMA2-72B 61.4 / 63.1 62.0 - - - - -
LLaVA-OneVision-72B 66.2 / 69.5 59.4 - 66.4 61.3 - -
Qwen2-VL-72B 71.2 / 77.8 73.6 1.70 - - 41.3 / 56.2 -
Qwen2.5-VL-72B 73.3 / 79.1 70.4 2.02 74.6 60.7 - -
InternVL2-Llama3-76B 64.7 / 67.8 69.6 1.71 69.9 61.1 - -
InternVL2.5-78B 72.1 / 74.0 76.4 1.97 75.7 63.6 42.2 / 58.5 66.0
InternVL3-78B 72.7 / 75.7 78.7 1.81 79.5 65.7 48.4 / 65.3 68.3

GUI Grounding

Benchmarks GPT-4o Gemini 2.0 Claude Aguvis-72B Qwen2.5-VL-72B UI-TARS-72B InternVL3-8B InternVL3-38B InternVL3-72B
ScreenSpot 18.1 84.0 83.0 89.2 87.1 88.4 79.5 85.6 88.7
ScreenSpot-V2 - - - - - 90.3 81.4 88.3 90.9

Spatial Reasoning

Model Name Obj.count Abs.Dist. Obj.size Room Size Rel.Dist. Rel.Dir. Route Plan Appr.Order Overall
GPT-4o 46.2 5.3 43.8 38.2 37.0 41.3 31.5 28.5 34.0
Gemini-1.5 Flash 49.8 30.8 53.5 54.4 37.7 41.0 31.5 37.8 42.1
Gemini-1.5 Pro 56.2 30.9 64.1 43.6 51.3 46.3 36.0 34.6 45.4
VILA-1.5-8B 17.4 21.8 50.3 18.8 32.1 34.8 31.0 24.8 28.9
LongVA-7B 38.0 16.6 38.9 22.2 33.1 43.3 25.4 15.7 29.2
LLaVA-NeXT-Video-7B 48.5 14.0 47.8 24.2 43.5 42.4 34.0 30.6 35.6
LLaVA-OneVision-7B 47.7 20.2 47.4 12.3 42.5 35.2 29.4 24.4 32.4
InternVL3-8B 68.1 39.0 48.4 33.6 48.3 36.4 27.3 35.4 42.1
InternVL3-38B 71.7 50.2 46.1 41.7 53.5 38.6 28.9 60.7 48.9
LLaVA-NeXT-Video-72B 48.9 22.8 57.4 35.3 42.4 36.7 35.0 48.6 40.9
LLaVA-OneVision-72B 43.5 23.9 57.6 37.5 42.5 39.9 32.5 44.6 40.2
InternVL3-78B 71.2 53.7 44.4 39.5 55.9 39.5 28.9 54.5 48.4

Citation


  @article{wang2024mpo,
    title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
    author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
    journal={arXiv preprint arXiv:2411.10442},
    year={2024}
  }

  @article{chen2024expanding,
    title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
    author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
    journal={arXiv preprint arXiv:2412.05271},
    year={2024}
  }

  @article{chen2024far,
    title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
    author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
    journal={arXiv preprint arXiv:2404.16821},
    year={2024}
  }

  @inproceedings{chen2024internvl,
    title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
    author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={24185--24198},
    year={2024}
  }


๐Ÿ”™ Go Back