Mini-InternVL 2.0: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

Type	Model	Download	Note
Vision Large Language Model	Mini-InternVL2-1B	🤗 HF link	🚀 Only 1B parameters, anyone can deploy it locally.
Mini-InternVL2-2B	🤗 HF link	🚀 Only 2B parameters, anyone can deploy it locally.
Mini-InternVL2-4B	🤗 HF link	🚀 5% of the model size, 90% of the performance.
Vision Foundation Model	InternViT-300M-448px	🤗 HF link	Distilled small vision foundation model with 300M parameters.

We introduce Mini-InternVL 2.0, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios.

We employ InternViT-300M as our visual encoder, a lightweight vision model that inherits the capabilities of a powerful vision encoder. We directly leverage InternViT-6B that has undergone generative training on diverse datasets to transfer knowledge to a lightweight vision model, CLIP-ViT-L-336px. Our Report contains a more detailed introduction..

Performance

Mini-InternVL 2.0 demonstrates strong performance across the majority of benchmarks. Our smallest model contains only 1 billion parameters, yet it demonstrates performance comparable to 2 billion parameter models, such as DeepSeek-VL-1.3B and MiniCPM-V 2.0. Compared to other lightweight models, our Mini-InternVL2-4B excels across most benchmarks, particularly in MMbench, ChartQA, DocVQA, and MathVista, where its performance is on par with commercial models like Gemini-Pro-1.5. Notably, compared to InternVL2-Llama3-76B, which utilizes the larger InternViT-6B, Mini-InternVL 2.0 achieves approximately 90% of its performance while using 5% parameters. This highlights the effectiveness of our knowledge distillation strategy.

name	MMMU (val)	MathVista (testmini)	AI2D (test)	ChartQA (test)	DocVQA (test)	InfoVQA (test)	OCRBench	MMB-EN (test)	MMB-CN (test)	OpenCompass (avg score)
GPT-4V* (20240409)	61.7	58.1	89.4	78.1	87.2	-	678	81.0	80.2	63.5
Gemini Pro 1.5*	60.6	57.7	80.3	81.3	86.5	72.7	754	73.9	73.8	64.4
Claude3.5-Sonnet*	65.9	67.7	94.7	90.8	95.2	-	788	79.7	80.7	67.9
GPT-4o* (20240513)	69.2	63.8	94.2	85.7	92.8	-	736	83.4	82.1	69.9
Cambrian-1	50.4	53.2	79.7	75.6	75.5	-	600	81.4	-	58.3
LLaVA-NeXT Qwen1.5	50.1	49.0	80.4	79.7	85.7	-	-	80.5	-	-
DeepSeek-VL-1.3B	33.8	29.8	51.5	-	-	-	413	64.6	62.9	39.6
MiniCPM-V 2.0	38.2	38.7	62.9	-	71.9	-	605	69.1	66.5	47.9
Qwen2-VL-2B	42.2	43.0	74.7	73.5	90.1	65.5	794	74.9	73.5	57.2
InternVL2-Llama3-76B	58.2	65.5	87.6	88.4	94.1	82.0	839	86.5	86.3	71.0
Mini-InternVL2-1B	36.7	37.7	64.1	72.9	81.7	50.9	754	65.4	60.7	48.3
Mini-InternVL2-2B	36.3	46.3	74.1	76.2	86.9	58.9	784	73.2	70.9	54.0
Mini-InternVL2-4B	48.3	58.6	78.9	81.5	89.2	67.0	788	78.6	73.9	60.6

We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for AI2D, ChartQA, DocVQA, InfoVQA, MMBench were tested using the InternVL repository. MMMU, MathVista and OCRBench were evaluated using the VLMEvalKit.
Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.

Model Card

Name		Mini-InternVL2-1B	Mini-InternVL2-2B	Mini-InternVL2-4B
Model Size	Total	0.94B	2.21B	4.15B
	ViT	304.01M	304.01M	304.01M
	MLP	4.48M	12.60M	22.03M
	LLM	0.63B	1.89B	3.82B
Resolution		dynamic resolution, max to 12 tiles of 448 × 448 in training, max to 40 tiles in testing (4K resolution).
Stage-1	Training Data	We entend the pre-training dataset used in InternVL 1.5 with data collected from diverse sources. These datasets span multiple tasks, including captioning, visual question answering, detection, grounding, and OCR. The OCR datasets were constructed using PaddleOCR to perform OCR on Chinese images from Wukong and on English images from LaionCOCO, and were manually verified. Besides, we also crawled and manually parsed the exam data from uworld, kaptest, testbank, aga, and sat. The interleaved data from OmniCorpus was also utilized.
Stage-1	Trainable Module	ViT + MLP		MLP
Stage-2	Training Data	We constructed the training data based on the 5M high-quality bilingual dataset used in InternVL 1.5. Specifically, we included video data such as EgoTaskQA, Mementos, STAR, NTU RGB+D, VideoChat2IT, and LSMDC-QA, as well as medical data such as Medical-Diff-VQA, Pathology-VQA, PMC-CaseReport, PMC-VQA, Slake, and VQA-RAD. We also included SROIE, FUNSD, and POIE to further enhance the model's ability to recognize handwritten fonts. Additionally, we excluded all data from ShareGPT-4V and replace it with data from ShareGPT-4o.
Stage-2	Trainable Module	ViT + MLP + LLM

Domain Adaptation

To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We hope to provide insights into the application of MLLMs.

In Document, we provide detailed information on the datasets and the fine-tuning process.

The results on remote sensing VQA and visual grounding tasks. For the VQA datasets, we omit area and count questions during evaluation. "DA" means model after domain adaptation on remote sensing.

Method	Size	RSVQA-LR				RSVQA-HR-Test1			RSVQA-HR-Test2			DIOR-RSVG (acc@0.5)
Method	Size	Rural/Urban	Presence	Compare	Avg.	Presence	Compare	Avg.	Presence	Compare	Avg.	-
RSVQA	-	90.00	87.46	81.50	86.32	90.43	88.19	83.12	86.26	85.94	77.50	-
Bi-Modal	-	92.66	91.06	92.66	91.63	92.03	91.83	84.98	89.37	89.62	80.54	-
EasyToHard	-	91.67	90.66	87.49	89.94	91.39	89.75	93.97	87.97	87.68	79.06	-
GeoChat	7B	94.00	91.09	90.33	90.70	-	-	-	-	58.45	83.19	72.30
SkyEyeGPT	-	75.00	88.93	88.63	84.19	84.95	85.63	85.29	83.50	80.28	81.89	88.59
SkySenseGPT	-	95.00	91.07	92.00	92.69	-	-	-	69.14	84.14	76.64	-
Mini-InternVL-4B	4B	66.00	64.64	73.26	69.55	62.42	79.20	71.72	65.73	79.70	73.55	16.89
InternVL2-Llama3-76B	76B	61.00	66.29	79.61	73.77	60.79	77.47	70.04	63.30	78.32	71.71	29.65
Mini-InternVL-DA-1B	1B	95.00	81.39	91.6	87.37	92.24	91.76	91.98	89.38	90.91	90.24	89.73
Mini-InternVL-DA-2B	2B	93.00	87.07	91.85	89.87	92.33	92.21	92.27	89.60	90.86	90.30	89.24
Mini-InternVL-DA-4B	4B	92.00	85.69	92.18	89.46	92.42	92.12	92.25	89.25	90.92	90.18	92.04

The results of our model on GMAI-MMBench. The results of other models are taken from the GMAI-MMBench leaderboard. "DA" means model after domain adaptation on medical data. After supervised fine-tuning, our model shows significant improvement and outperforms several medical-specialized models (e.g., LLaVA-Med, RadFM) and some commercial closed-source models (e.g., Claude3-Opus) on most metrics.

Model	Size	Seg C	Seg M	2D Cls	2D Det	2D Mcls_acc	2D Mcls_recall
Qwen-VL-Chat	9.6B	34.45	35.20	39.55	22.04	42.88	81.23
LLaVA-NeXT-mistral-7B	7.6B	36.29	35.20	39.34	27.87	44.05	47.70
LLaVA-Med	-	18.45	18.97	21.15	17.14	45.84	41.19
RadFM	14B	20.43	20.27	25.71	18.83	40.98	57.45
Claude3-Opus	-	33.56	33.36	32.17	24.72	45.31	38.98
GPT-4V	-	47.87	46.58	42.24	30.32	45.21	40.59
Mini-InternVL-1B	1B	34.30	34.55	36.02	24.08	21.67	8.57
Mini-InternVL-2B	2B	35.33	35.61	38.08	25.31	43.52	16.13
Mini-InternVL-4B	4B	36.60	36.99	38.74	26.01	43.99	16.25
Mini-InternVL-DA-1B	1B	38.67	39.44	35.87	23.09	22.79	8.99
Mini-InternVL-DA-2B	2B	40.22	39.46	39.34	25.59	44.33	16.20
Mini-InternVL-DA-4B	4B	41.41	40.45	41.34	24.84	44.33	16.59

The results on action and control signals prediction tasks of BDD-X dataset. In action tasks, we follow DriveGPT4 and provide evaluation results on action description, action justification, and full-text generation (i.e., combining description and justification). "B4" stands for BLEU4. RMSE denotes the root mean squared error, and A_τ measures the proportion of test samples with prediction errors less than τ. "DA" means model after domain adaptation on BDD-X.

Method	Description			Justification			Full
Method	CIDEr	B4	ROUGE	CIDEr	B4	ROUGE	CIDEr	B4	ROUGE
ADAPT	219.35	33.42	61.83	94.62	9.95	32.01	93.66	17.76	44.32
DriveGPT4	254.62	35.99	63.97	101.55	10.84	31.91	102.71	19.00	45.10
Mini-InternVL-DA-1B	223.85	34.17	62.11	95.52	9.70	32.58	83.72	16.78	44.29
Mini-InternVL-DA-2B	242.14	35.77	63.03	105.06	10.63	32.46	98.47	18.05	44.52
Mini-InternVL-DA-4B	237.41	35.94	63.67	104.62	9.51	32.23	97.42	17.70	44.98

Method	Speed (m/s)					Turning Angle (degree)
Method	RMSE ↓	A_0.1 ↑	A_0.5 ↑	A_1.0 ↑	A_5.0 ↑	RMSE ↓	A_0.1 ↑	A_0.5 ↑	A_1.0 ↑	A_5.0 ↑
ADAPT	3.02	9.56	24.77	37.07	90.39	11.98	27.93	66.83	75.13	89.45
DriveGPT4	1.30	30.09	60.88	79.92	98.44	8.98	59.23	72.89	79.59	95.32
Mini-InternVL-DA-1B	1.28	29.44	60.38	79.34	98.67	9.45	59.34	73.54	80.28	92.76
Mini-InternVL-DA-2B	1.26	27.96	59.23	80.06	98.78	9.52	57.40	72.54	80.06	92.04
Mini-InternVL-DA-4B	1.31	28.84	60.94	78.78	98.61	9.46	59.12	73.15	80.17	92.65

The results on driving with language official leaderboard. "DA" means model after domain adaptation on DriveLM. The other results in the table are taken from the 2024 Autonomous Driving Challenge leaderboard. MTMM†, MMFM_AD, and Team NVIDIA are team names on the challenge leaderboard, which we use to represent their methods.

Method	#Param	Accuracy	ChatGPT	Bleu 1	Bleu 2	Bleu 3	Bleu 4	ROUGE L	CIDEr	Match	Final Score
InternVL4Drive-v2	26B	0.7339	65.25	0.7787	0.7176	0.6608	0.6059	0.7449	0.2061	47.65	0.6002
MTMM†	-	0.7473	65.59	0.76	0.70	0.64	0.59	0.74	0.18	0.45	0.5974
Team NVIDIA	-	0.7746	59.89	-	-	-	-	-	-	-	0.5884
MMFM_AD	-	0.6658	63.92	-	-	-	-	-	-	-	0.5732
Mini-InternVL-4B	4B	0.0	54.45	0.2405	0.0801	0.0252	0.0084	0.1927	0.0018	34.30	0.3051
InternVL2-Llama3-76B	76B	0.0	52.50	0.2100	0.0884	0.0249	0.0078	0.1848	0.0001	34.22	0.2963
Mini-InternVL-DA-1B	1B	0.7007	63.84	0.7362	0.6767	0.6214	0.5678	0.7365	0.1669	39.76	0.5686
Mini-InternVL-DA-2B	2B	0.7628	65.23	0.7616	0.7012	0.6452	0.5908	0.7447	0.1914	43.24	0.5958
Mini-InternVL-DA-4B	4B	0.7296	63.97	0.7642	0.7032	0.6463	0.5914	0.7427	0.1976	42.16	0.5821

Citation


  @article{gao2024mini,
    title={Mini-InternVL: a flexible-transfer pocket multi-modal model with 5\% parameters and 90\% performance},
    author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
    journal={Visual Intelligence},
    volume={2},
    number={1},
    pages={1--17},
    year={2024},
    publisher={Springer}
  }

  @article{chen2024far,
    title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
    author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
    journal={arXiv preprint arXiv:2404.16821},
    year={2024}
  }

  @inproceedings{chen2024internvl,
    title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
    author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={24185--24198},
    year={2024}
  }

Performance

Model Card

Domain Adaptation

Citation

🔙 Go Back