InternVL2

Type	Model	Date	HF Link	MS Link	Document
Multimodal Large Language Models	InternVL2-1B	2024.07.08	🤗 link	🤖 link	📖 doc
InternVL2-2B	2024.07.04	🤗 link	🤖 link	📖 doc
InternVL2-4B	2024.07.04	🤗 link	🤖 link	📖 doc
InternVL2-8B	2024.07.04	🤗 link	🤖 link	📖 doc
InternVL2-26B	2024.07.04	🤗 link	🤖 link	📖 doc
InternVL2-40B	2024.07.08	🤗 link	🤖 link	📖 doc
InternVL2-Llama3-76B	2024.07.15	🤗 link	🤖 link	📖 doc
InternVL2-108B	TODO	TODO	TODO	TODO
InternVL2-Pro	TODO	TODO	TODO	TODO
Vision Foundation Model	InternViT-300M-448px	2024.05.25	🤗 link	🤖 link	📖 doc
InternViT-6B-448px-V1-5	2024.04.20	🤗 link	🤖 link	📖 doc

We introduce InternVL2, currently the most powerful open-source Multimodal Large Language Model (MLLM). The InternVL2 family includes models ranging from a 1B model, suitable for edge devices, to a 108B model, which is significantly more powerful. With larger-scale language models, InternVL2-Pro demonstrates outstanding multimodal understanding capabilities, matching the performance of commercial closed-source models across various benchmarks.

InternVL2 family is built upon the following designs:

Progressive with larger language models: We introduce a progressive alignment training strategy, resulting in the first vision foundation model natively aligned with large language models. By employing the progressive training strategy where the model scales from small to large while the data refines from coarse to fine, we have completed the training of large models at a relatively low cost. This approach has demonstrated excellent performance with limited resources.
Multimodal input: With one set of parameters, our model supports multiple modalities of input, including text, images, video, and medical data.
Multitask output: Powered by our recent work VisionLLMv2, our model supports various output formats, such as images, bounding boxes, and masks, demonstrating extensive versatility. By connecting the MLLM with multiple downstream task decoders, InternVL2 can be generalized to hundreds of vision-language tasks while achieving performance comparable to expert models.

Model Card

Name		InternVL2-2B	InternVL2-4B	InternVL2-8B	InternVL2-26B	InternVL2-40B	InternVL2-108B
Model Size	Total	2.21B	4.15B	8.08B	25.51B	40.07B	108.70B
	ViT	304.01M	304.01M	304.01M	5.54B	5.54B	5.54B
	MLP	12.60M	22.03M	33.57M	116.43M	143.17M	172.01M
	LLM	2.21B	3.82B	7.74B	19.86B	34.39B	102.99B
Resolution		dynamic resolution, max to 12 tiles of 448 × 448 in training, max to 40 tiles in testing (4K resolution).
Stage-1	Training Data	We entend the pre-training dataset used in InternVL 1.5 with data collected from diverse sources. These datasets span multiple tasks, including captioning, visual question answering, detection, grounding, and OCR. The OCR datasets were constructed using PaddleOCR to perform OCR on Chinese images from Wukong and on English images from LaionCOCO, and were manually verified. Besides, we also crawled and manually parsed the exam data from uworld, kaptest, testbank, aga, and sat. The interleaved data from OmniCorpus was also utilized.
Stage-1	Trainable Module	MLP
Stage-2	Training Data	We constructed the training data based on the 5M high-quality bilingual dataset used in InternVL 1.5. Specifically, we included video data such as EgoTaskQA, Mementos, STAR, NTU RGB+D, VideoChat2IT, and LSMDC-QA, as well as medical data such as Medical-Diff-VQA, Pathology-VQA, PMC-CaseReport, PMC-VQA, Slake, and VQA-RAD. We also included SROIE, FUNSD, and POIE to further enhance the model's ability to recognize handwritten fonts. Additionally, we excluded all data from ShareGPT-4V and replace it with data from ShareGPT-4o.
Stage-2	Trainable Module	ViT + MLP + LLM

Performance

InternVL2 demonstrates powerful capabilities in handling complex multimodal data, excelling in tasks such as mathematics, scientific charts, general charts, documents, infographics, and OCR. For instance, InternVL2 achieves an accuracy of 66.3% on the MathVista benchmark, significantly surpassing other closed-source commercial models and open-source models. Moreover, InternVL2 achieves state-of-the-art performance across a wide range of benchmarks, including the general chart benchmark ChartQA, the document benchmark DocVQA, the infographic benchmark InfographicVQA, and the general visual question answering benchmark MMBench.

Notably, there are two evaluation settings in the AI2D benchmark. In the first setting, we replace the content within the rectangles in the images with the letters of the options. In the second setting, we replace the content within the rectangles with both the letters of the options and the value of the options. Our model achieves a performance of 87.3 in the first setting and 96.0 in the second setting.

* Proprietary Model

name	MMMU (val)	MathVista (testmini)	AI2D (test)	ChartQA (test)	DocVQA (test)	InfoVQA (test)	OCRBench	MMB-EN (test)	MMB-CN (test)	OpenCompass (avg score)
GPT-4V* (20240409)	63.1 / 61.7	58.1	89.4	78.1	87.2	-	678	81.0	80.2	63.5
Gemini Pro 1.5*	58.5 / 60.6	57.7	80.3	81.3	86.5	72.7	754	73.9	73.8	64.4
Claude3.5-Sonnet*	68.3 / 65.9	67.7	94.7	90.8	95.2	-	788	79.7	80.7	67.9
GPT-4o* (20240513)	69.1 / 69.2	63.8	94.2	85.7	92.8	-	736	83.4	82.1	69.9
Cambrian-1	49.7 / 50.4	53.2	79.7	75.6	75.5	-	600	81.4	-	58.3
LLaVA-NeXT Qwen1.5	50.1	49.0	80.4	79.7	85.7	-	-	80.5	-	-
InternVL2-Pro	58.9 / 62.0	66.3	87.3 / 96.0	87.1	95.1	83.3	837	87.8	87.2	71.8

name	MMMU (val)	MathVista (testmini)	AI2D (test)	ChartQA (test)	DocVQA (test)	InfoVQA (test)	OCRBench	MMB-EN (test)	MMB-CN (test)	OpenCompass (avg score)
InternVL2-1B	35.4 / 36.7	37.7	64.1	72.9	81.7	50.9	754	65.4	60.7	48.3
InternVL2-2B	34.3 / 36.3	46.3	74.1	76.2	86.9	58.9	784	73.2	70.9	54.0
InternVL2-4B	47.0 / 48.3	58.6	78.9	81.5	89.2	67.0	788	78.6	73.9	60.6
InternVL2-8B	49.3 / 51.2	58.3	83.8	83.3	91.6	74.8	794	81.7	81.2	64.1
InternVL2-26B	48.3 / 50.7	59.4	84.5	84.9	92.9	75.9	825	83.4	82.0	66.4
InternVL2-40B	53.9 / 55.2	63.7	87.1	86.2	93.9	78.7	837	86.8	86.5	69.7
InternVL2-Llama3-76B	55.2 / 58.2	65.5	87.6	88.4	94.1	82.0	839	86.5	86.3	71.0
InternVL2-Pro	58.9 / 62.0	66.3	87.3 / 96.0	87.1	95.1	83.3	837	87.8	87.2	71.8

We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for AI2D, ChartQA, DocVQA, InfoVQA, MMBench were tested using the InternVL repository. MathVista and OCRBench were evaluated using the VLMEvalKit.
For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.

In addition to the VQA benchmarks mentioned above, we also evaluate our InternVL2-Pro in the MM-NIAH benchmark, which is a comprehensive benchmark designed for long multimodal documents understanding. As shown in the figure below, our model with Retrieval Augmented Generation(RAG) exhibits comparable performance to Gemini on comprehending long multimodal documents. Improving performance on the counting task and other tasks involving image needles will be left for future work. See this paper for more details about the InternVL2-Pro augmented with RAG.

Examples

Citation


  @article{chen2023internvl,
      title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
      author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
      journal={arXiv preprint arXiv:2312.14238},
      year={2023}
  }
  @article{chen2024far,
    title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
    author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
    journal={arXiv preprint arXiv:2404.16821},
    year={2024}
  }

InternVL2: Better than the Best—Expanding Performance Boundaries of Open-Source Multimodal Models with the Progressive Scaling Strategy

Model Card

Performance

Examples

Citation

🔙 Go Back