image

Performance

Mini-InternVL 2.0 demonstrates strong performance across the majority of benchmarks. Our smallest model contains only 1 billion parameters, yet it demonstrates performance comparable to 2 billion parameter models, such as DeepSeek-VL-1.3B and MiniCPM-V 2.0. Compared to other lightweight models, our Mini-InternVL2-4B excels across most benchmarks, particularly in MMbench, ChartQA, DocVQA, and MathVista, where its performance is on par with commercial models like Gemini-Pro-1.5. Notably, compared to InternVL2-Llama3-76B, which utilizes the larger InternViT-6B, Mini-InternVL 2.0 achieves approximately 90% of its performance while using 5% parameters. This highlights the effectiveness of our knowledge distillation strategy.

name MMMU
(val)
MathVista
(testmini)
AI2D
(test)
ChartQA
(test)
DocVQA
(test)
InfoVQA
(test)
OCRBench MMB-EN
(test)
MMB-CN
(test)
OpenCompass
(avg score)
GPT-4V*
(20240409)
61.7 58.1 89.4 78.1 87.2 - 678 81.0 80.2 63.5
Gemini Pro 1.5* 60.6 57.7 80.3 81.3 86.5 72.7 754 73.9 73.8 64.4
Claude3.5-Sonnet* 65.9 67.7 94.7 90.8 95.2 - 788 79.7 80.7 67.9
GPT-4o*
(20240513)
69.2 63.8 94.2 85.7 92.8 - 736 83.4 82.1 69.9
Cambrian-1 50.4 53.2 79.7 75.6 75.5 - 600 81.4 - 58.3
LLaVA-NeXT Qwen1.5 50.1 49.0 80.4 79.7 85.7 - - 80.5 - -
DeepSeek-VL-1.3B 33.8 29.8 51.5 - - - 413 64.6 62.9 39.6
MiniCPM-V 2.0 38.2 38.7 62.9 - 71.9 - 605 69.1 66.5 47.9
Qwen2-VL-2B 42.2 43.0 74.7 73.5 90.1 65.5 794 74.9 73.5 57.2
InternVL2-Llama3-76B 58.2 65.5 87.6 88.4 94.1 82.0 839 86.5 86.3 71.0
Mini-InternVL2-1B 36.7 37.7 64.1 72.9 81.7 50.9 754 65.4 60.7 48.3
Mini-InternVL2-2B 36.3 46.3 74.1 76.2 86.9 58.9 784 73.2 70.9 54.0
Mini-InternVL2-4B 48.3 58.6 78.9 81.5 89.2 67.0 788 78.6 73.9 60.6
  • We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for AI2D, ChartQA, DocVQA, InfoVQA, MMBench were tested using the InternVL repository. MMMU, MathVista and OCRBench were evaluated using the VLMEvalKit.
  • Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.

Model Card

NameMini-InternVL2-1BMini-InternVL2-2BMini-InternVL2-4B
Model SizeTotal0.94B2.21B4.15B
ViT304.01M304.01M304.01M
MLP4.48M12.60M22.03M
LLM0.63B1.89B3.82B
Resolutiondynamic resolution, max to 12 tiles of 448 × 448 in training, max to 40 tiles in testing (4K resolution).
Stage-1Training Data We entend the pre-training dataset used in InternVL 1.5 with data collected from diverse sources. These datasets span multiple tasks, including captioning, visual question answering, detection, grounding, and OCR. The OCR datasets were constructed using PaddleOCR to perform OCR on Chinese images from Wukong and on English images from LaionCOCO, and were manually verified. Besides, we also crawled and manually parsed the exam data from uworld, kaptest, testbank, aga, and sat. The interleaved data from OmniCorpus was also utilized.
Trainable ModuleViT + MLPMLP
Stage-2Training DataWe constructed the training data based on the 5M high-quality bilingual dataset used in InternVL 1.5. Specifically, we included video data such as EgoTaskQA, Mementos, STAR, NTU RGB+D, VideoChat2IT, and LSMDC-QA, as well as medical data such as Medical-Diff-VQA, Pathology-VQA, PMC-CaseReport, PMC-VQA, Slake, and VQA-RAD. We also included SROIE, FUNSD, and POIE to further enhance the model's ability to recognize handwritten fonts. Additionally, we excluded all data from ShareGPT-4V and replace it with data from ShareGPT-4o.
Trainable ModuleViT + MLP + LLM

Domain Adaptation

To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We hope to provide insights into the application of MLLMs.

In Document, we provide detailed information on the datasets and the fine-tuning process.

image

The results on remote sensing VQA and visual grounding tasks. For the VQA datasets, we omit area and count questions during evaluation. "DA" means model after domain adaptation on remote sensing.

Method Size RSVQA-LR RSVQA-HR-Test1 RSVQA-HR-Test2 DIOR-RSVG (acc@0.5)
Rural/Urban Presence Compare Avg. Presence Compare Avg. Presence Compare Avg. -
RSVQA - 90.00 87.46 81.50 86.32 90.43 88.19 83.12 86.26 85.94 77.50 -
Bi-Modal - 92.66 91.06 92.66 91.63 92.03 91.83 84.98 89.37 89.62 80.54 -
EasyToHard - 91.67 90.66 87.49 89.94 91.39 89.75 93.97 87.97 87.68 79.06 -
GeoChat 7B 94.00 91.09 90.33 90.70 - - - - 58.45 83.19 72.30
SkyEyeGPT - 75.00 88.93 88.63 84.19 84.95 85.63 85.29 83.50 80.28 81.89 88.59
SkySenseGPT - 95.00 91.07 92.00 92.69 - - - 69.14 84.14 76.64 -
Mini-InternVL-4B 4B 66.00 64.64 73.26 69.55 62.42 79.20 71.72 65.73 79.70 73.55 16.89
InternVL2-Llama3-76B 76B 61.00 66.29 79.61 73.77 60.79 77.47 70.04 63.30 78.32 71.71 29.65
Mini-InternVL-DA-1B 1B 95.00 81.39 91.6 87.37 92.24 91.76 91.98 89.38 90.91 90.24 89.73
Mini-InternVL-DA-2B 2B 93.00 87.07 91.85 89.87 92.33 92.21 92.27 89.60 90.86 90.30 89.24
Mini-InternVL-DA-4B 4B 92.00 85.69 92.18 89.46 92.42 92.12 92.25 89.25 90.92 90.18 92.04

The results of our model on GMAI-MMBench. The results of other models are taken from the GMAI-MMBench leaderboard. "DA" means model after domain adaptation on medical data. After supervised fine-tuning, our model shows significant improvement and outperforms several medical-specialized models (e.g., LLaVA-Med, RadFM) and some commercial closed-source models (e.g., Claude3-Opus) on most metrics.

Model Size Seg C Seg M 2D Cls 2D Det 2D Mcls_acc 2D Mcls_recall
Qwen-VL-Chat 9.6B 34.45 35.20 39.55 22.04 42.88 81.23
LLaVA-NeXT-mistral-7B 7.6B 36.29 35.20 39.34 27.87 44.05 47.70
LLaVA-Med - 18.45 18.97 21.15 17.14 45.84 41.19
RadFM 14B 20.43 20.27 25.71 18.83 40.98 57.45
Claude3-Opus - 33.56 33.36 32.17 24.72 45.31 38.98
GPT-4V - 47.87 46.58 42.24 30.32 45.21 40.59
Mini-InternVL-1B 1B 34.30 34.55 36.02 24.08 21.67 8.57
Mini-InternVL-2B 2B 35.33 35.61 38.08 25.31 43.52 16.13
Mini-InternVL-4B 4B 36.60 36.99 38.74 26.01 43.99 16.25
Mini-InternVL-DA-1B 1B 38.67 39.44 35.87 23.09 22.79 8.99
Mini-InternVL-DA-2B 2B 40.22 39.46 39.34 25.59 44.33 16.20
Mini-InternVL-DA-4B 4B 41.41 40.45 41.34 24.84 44.33 16.59

The results on action and control signals prediction tasks of BDD-X dataset. In action tasks, we follow DriveGPT4 and provide evaluation results on action description, action justification, and full-text generation (i.e., combining description and justification). "B4" stands for BLEU4. RMSE denotes the root mean squared error, and Aτ measures the proportion of test samples with prediction errors less than τ. "DA" means model after domain adaptation on BDD-X.

Method Description Justification Full
CIDEr B4 ROUGE CIDEr B4 ROUGE CIDEr B4 ROUGE
ADAPT 219.35 33.42 61.83 94.62 9.95 32.01 93.66 17.76 44.32
DriveGPT4 254.62 35.99 63.97 101.55 10.84 31.91 102.71 19.00 45.10
Mini-InternVL-DA-1B 223.85 34.17 62.11 95.52 9.70 32.58 83.72 16.78 44.29
Mini-InternVL-DA-2B 242.14 35.77 63.03 105.06 10.63 32.46 98.47 18.05 44.52
Mini-InternVL-DA-4B 237.41 35.94 63.67 104.62 9.51 32.23 97.42 17.70 44.98
Method Speed (m/s) Turning Angle (degree)
RMSE ↓ A0.1 A0.5 A1.0 A5.0 RMSE ↓ A0.1 A0.5 A1.0 A5.0
ADAPT 3.02 9.56 24.77 37.07 90.39 11.98 27.93 66.83 75.13 89.45
DriveGPT4 1.30 30.09 60.88 79.92 98.44 8.98 59.23 72.89 79.59 95.32
Mini-InternVL-DA-1B 1.28 29.44 60.38 79.34 98.67 9.45 59.34 73.54 80.28 92.76
Mini-InternVL-DA-2B 1.26 27.96 59.23 80.06 98.78 9.52 57.40 72.54 80.06 92.04
Mini-InternVL-DA-4B 1.31 28.84 60.94 78.78 98.61 9.46 59.12 73.15 80.17 92.65

The results on driving with language official leaderboard. "DA" means model after domain adaptation on DriveLM. The other results in the table are taken from the 2024 Autonomous Driving Challenge leaderboard. MTMM†, MMFM_AD, and Team NVIDIA are team names on the challenge leaderboard, which we use to represent their methods.

Method #Param Accuracy ChatGPT Bleu 1 Bleu 2 Bleu 3 Bleu 4 ROUGE L CIDEr Match Final Score
InternVL4Drive-v2 26B 0.7339 65.25 0.7787 0.7176 0.6608 0.6059 0.7449 0.2061 47.65 0.6002
MTMM† - 0.7473 65.59 0.76 0.70 0.64 0.59 0.74 0.18 0.45 0.5974
Team NVIDIA - 0.7746 59.89 - - - - - - - 0.5884
MMFM_AD - 0.6658 63.92 - - - - - - - 0.5732
Mini-InternVL-4B 4B 0.0 54.45 0.2405 0.0801 0.0252 0.0084 0.1927 0.0018 34.30 0.3051
InternVL2-Llama3-76B 76B 0.0 52.50 0.2100 0.0884 0.0249 0.0078 0.1848 0.0001 34.22 0.2963
Mini-InternVL-DA-1B 1B 0.7007 63.84 0.7362 0.6767 0.6214 0.5678 0.7365 0.1669 39.76 0.5686
Mini-InternVL-DA-2B 2B 0.7628 65.23 0.7616 0.7012 0.6452 0.5908 0.7447 0.1914 43.24 0.5958
Mini-InternVL-DA-4B 4B 0.7296 63.97 0.7642 0.7032 0.6463 0.5914 0.7427 0.1976 42.16 0.5821

Citation


  @article{gao2024mini,
    title={Mini-InternVL: a flexible-transfer pocket multi-modal model with 5\% parameters and 90\% performance},
    author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
    journal={Visual Intelligence},
    volume={2},
    number={1},
    pages={1--17},
    year={2024},
    publisher={Springer}
  }

  @article{chen2024far,
    title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
    author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
    journal={arXiv preprint arXiv:2404.16821},
    year={2024}
  }

  @inproceedings{chen2024internvl,
    title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
    author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={24185--24198},
    year={2024}
  }


🔙 Go Back