Mini-InternVL 1.5: A Powerful Pocket Multimodal Model with 8% Parameters for 80% Performance
[๐ Go Back] [๐ InternVL 1.0 Paper] [๐ InternVL 1.5 Report] [๐จ๏ธ Chat Demo] [๐ค HF Demo] [ ModelScope] [๐ Quick Start] [๐ ไธญๆ่งฃ่ฏป]
8% of the model size, 80% of the performance; 16% of the model size, 90% of the performance.
Type | Model | Date | Download | Note |
---|---|---|---|---|
Vision Large Language Model | Mini-InternVL-Chat-2B-V1-5 | 2024.05.19 | ๐ค HF link | ๐๐ Only 2B parameters, anyone can deploy it locally. |
Mini-InternVL-Chat-4B-V1-5 | 2024.05.28 | ๐ค HF link | ๐ Only 4B parameters, anyone can deploy it locally. | |
Vision Foundation Model | InternViT-300M-448px | 2024.05.25 | ๐ค HF link | Distilled small vision foundation model with 300M parameters. |
You can run multimodal large models using a 1080Ti now.
We are delighted to introduce Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model InternViT-6B-448px-V1-5 down to 300M and used InternLM2-Chat-1.8B or Phi-3-mini-128k-instruct as our language model. This resulted in a small multimodal model with excellent performance.
As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B or Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
From the experimental results, we've observed that our distilled small vision model (InternViT-300M) is well-suited for a smaller language model (1.8B or 3.8B). This combination maximizes efficiency while maintaining impressive performance across various benchmarks, demonstrating the effectiveness of small models in handling complex tasks. Additionally, our small model significantly reduces memory requirements, making it more accessible and efficient for practical use.
Performance
Comparison with SoTA models on 16 multimodal benchmarks. OCR-related benchmarks include: DocVQA test, ChartQA average test, InfographicVQA test, TextVQA val, and OCRBench. General multimodal benchmarks encompass: MME, RealWorldQA, AI2D test, MMMU val, MMBench-EN/CN test, CCBench dev, MMVet, SEED Image, and HallusionBench. Additionally, the math dataset includes MathVista testmini. The MME results we report are the sum of the perception and cognition scores. The results of OCRBench, MMBench, CCBench, and HallusionBench are collected from the OpenCompass leaderboard.
model | open- source |
#param | DocVQA (test) |
ChartQA (test) |
InfoVQA (test) |
TextVQA (val) |
OCR Bench |
MME | RWQA | AI2D (test) |
MMMU (val) |
MMB-EN/CN (test) |
CCBench (dev) |
MMVet | SEED (image) |
HallB | MathVista (mini) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4V | โ | - | 88.4 | 78.5 | - | 78.0 | 645 | 1926.6 | 61.4 | 78.2 | 56.8 | 77.0 / 74.4 | 46.5 | 67.6 | 71.6 | 46.5 | 49.9 |
Gemini Pro 1.0 | โ | - | 88.1 | 74.1 | 75.2 | 74.6 | 659 | 1933.4 | - | 73.9 | 47.9 | 73.6 / 74.3 | 52.5 | 64.3 | 70.7 | 45.2 | 45.2 |
Gemini Pro 1.5 | โ | - | 86.5 | 81.3 | 72.7 | 73.5 | - | - | 67.5 | 80.3 | 58.5 | - / - | - | - | - | - | 52.1 |
Qwen-VL-Plus | โ | - | 91.4 | 78.1 | - | - | 694 | 2183.4 | - | 75.9 | 45.2 | 67.0 / 70.7 | 55.1 | 61.1 | 72.7 | 40.6 | 43.3 |
Claude-3 Haiku | โ | - | 88.8 | 81.7 | - | - | 658 | 1453.2 | - | 86.7 | 50.2 | 60.7 / 57.2 | 24.5 | - | - | 39.2 | 46.4 |
Step-1V | โ | 100B | - | - | - | - | 625 | 2206.4 | - | 79.2 | 49.9 | 80.7 / 79.9 | 71.2 | 63.3 | 70.3 | 48.4 | 44.8 |
Grok-1.5V | โ | - | 85.6 | 76.1 | - | 78.1 | - | - | 68.7 | 88.3 | - | - / - | - | - | - | - | 52.8 |
LLaVA-NeXT-34B | โ | 35B | 84.0 | 68.7 | 51.5 | 69.5 | 574 | 2028.0 | - | 74.9 | 51.1 | 81.1 / 79.0 | 49.2 | 57.4 | 75.9 | 34.8 | 46.5 |
LLaVA-NeXT-110B | โ | 112B | 85.7(val) | 79.7 | - | - | - | 2200.4 | 63.1 | 80.4 | 49.1 | - / - | - | - | - | - | 49.0 |
InternVL 1.2 | โ | 40B | 57.7 | 68.0 | 39.5 | 72.5 | 569 | 2175.4 | 67.5 | 79.0 | 51.6 | 82.2 / 81.2 | 59.2 | 48.9 | 75.6 | 47.6 | 47.7 |
InternVL 1.5 | โ | 25.5B | 90.9 | 83.8 | 72.5 | 80.6 | 724 | 2187.8 | 66.0 | 80.7 | 45.2 | 82.2 / 82.0 | 69.8 | 62.8 | 76.0 | 49.3 | 53.5 |
MobileVLM-V2-1.7B | โ | 1.7B | - | - | - | 52.1 | - | - | - | - | - | - | - | - | - | - | - |
MobileVLM-V2-3B | โ | 3.0B | - | - | - | 57.5 | - | - | - | - | - | - | - | - | - | - | - |
Mini-Gemini-2B | โ | 3.5B | 34.2 | - | - | 56.2 | - | 1653.0 | - | - | 31.7 | - / - | - | 31.1 | - | - | 29.4 |
Bunny-v1.0-3B | โ | 3.2B | - | - | - | - | - | 1778.1 | - | - | 38.2 | 69.2 / - | - | - | - | - | - |
Bunny-v1.1-4B | โ | 4.3B | - | - | - | - | - | 1866.8 | - | - | 40.2 | 74.1 / 66.3 | - | - | 71.7 | - | - |
DeepSeek-VL-1.3B | โ | 2.0B | - | - | - | 57.8 | 409 | 1531.6 | 49.7 | 51.5 | 32.2 | 66.4 / 62.9 | 37.6 | 34.8 | 66.7 | 27.6 | 31.1 |
PaliGemma-3B | โ | 2.9B | - | - | - | 68.1 | 614 | 1686.1 | 55.2 | 68.3 | 34.9 | 71.0 / 63.6 | 29.6 | 33.1 | 69.6 | 32.2 | 28.7 |
MiniCPM-V | โ | 3.4B | 38.2 | - | - | 60.6 | 366 | 1650.2 | 51.2 | 56.3 | 38.3 | 64.1 / 62.6 | 41.4 | 31.1 | 65.6 | 36.2 | 28.9 |
MiniCPM-V-2 | โ | 3.4B | 71.9 | - | - | 74.1 | 605 | 1808.6 | 55.8 | 62.9 | 38.2 | 69.1 / 66.5 | 45.3 | 41.0 | 67.1 | 36.1 | 38.7 |
Phi-3-vision-128k-instruct | โ | 4.2B | - | 81.4 | - | 70.9 | 639 | 1508.0 | 58.8 | 76.7 | 40.4 | 73.6 / - | 24.1 | - | 70.9 | 39.0 | 44.5 |
Mini-InternVL-2B-1.5 | โ | 2.2B | 85.0 | 74.8 | 55.4 | 70.5 | 654 | 1901.5 | 57.9 | 69.8 | 34.6 | 70.9 / 66.2 | 63.5 | 39.3 | 69.8 | 37.5 | 41.1 |
Percent of InternVL-1.5 | 8.6% | 93.5% | 89.3% | 76.4% | 87.5% | 90.3% | 86.9% | 87.7% | 86.5% | 76.5% | 83.5% | 91.0% | 62.6% | 91.8% | 76.1% | 76.8% | |
Mini-InternVL-4B-1.5 | โ | 4.2B | 87.7 | 81.0 | 64.6 | 72.5 | 638 | 2053.6 | 60.1 | 76.9 | 43.3 | 76.2 / 70.3 | 58.8 | 46.7 | 72.5 | 42.8 | 53.7 |
Percent of InternVL-1.5 | 16.5% | 96.5% | 96.7% | 89.1% | 90.0% | 88.1% | 93.9% | 91.1% | 95.3% | 95.8% | 89.2% | 84.2% | 74.4% | 95.4% | 86.8% | 100.0% |
Model Card
Name | Mini-InternVL-Chat-2B-V1-5 | Mini-InternVL-Chat-4B-V1-5 | |
---|---|---|---|
Model Size | Total | 2.21B | 4.15B |
ViT | 304.01M | 304.01M | |
MLP | 12.60M | 22.03M | |
LLM | 1.89B | 3.82B | |
Resolution | dynamic resolution, max to 12 tiles of 448 ร 448 in training, max to 40 tiles in testing (4K resolution). | ||
Training Data | The pre-training dataset utilized in our InternVL 1.5 encompasses a diverse range of publicly accessible sources. These datasets span multiple tasks, including captioning, which predominantly uses datasets such as Laion-EN, Laion-ZH, COYO, and GRIT, constituting 53.9% of the total data. Detection and grounding tasks utilize datasets like Objects365, GRIT, and All-Seeing, making up 5.2%. For OCR tasks, we utilized large-scale datasets such as Wukong-OCR, LaionCOCO-OCR, and Common Crawl PDFs, which constitute 32.0% of our data. These datasets were constructed using PaddleOCR to perform OCR on Chinese images from Wukong and on English images from LaionCOCO. Smaller OCR datasets include MMC-Inst, LSVT, ST-VQA, RCTW-17, ArT, and others, accounting for 8.9% of the data, which focus on more specific or constrained OCR challenges. | ||
Trainable Module | ViT + MLP | MLP | |
Stage-2 | Training Data | 5M high-quality bilingual data. Please see the technical report of InternVL 1.5 for more details. | |
Trainable Module | ViT + MLP + LLM |
The hyperparameters used for pre-training and fine-tuning are listed in the following table.
Size | Stage | Trainable Module | #Sample | Drop Path | Batch Size | LR | Epoch | Max Length | Weight Decay | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|---|
2B | Pretrain | ViT + MLP | ~200M | 0.1 | 2048 | 2e-5 | 1 | 4096 | 0.01 | Link | ViT / MLP |
Finetune | ViT + MLP + LLM | ~5M | 0.1 | 1024 | 4e-5 | 1 | 8192 | 0.01 | Link | MLLM | |
4B | Pretrain | MLP | ~24M | 0.0 | 2048 | 2e-4 | 1 | 4096 | 0.05 | Link | MLP |
Finetune | ViT + MLP + LLM | ~5M | 0.1 | 1024 | 4e-5 | 1 | 8192 | 0.05 | Link | MLLM |
Citation
@article{chen2023internvl,
title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
journal={arXiv preprint arXiv:2312.14238},
year={2023}
}
@article{chen2024far,
title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
journal={arXiv preprint arXiv:2404.16821},
year={2024}
}