image

You can run multimodal large models using a 1080Ti now.

We are delighted to introduce Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model InternViT-6B-448px-V1-5 down to 300M and used InternLM2-Chat-1.8B or Phi-3-mini-128k-instruct as our language model. This resulted in a small multimodal model with excellent performance.

As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B or Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.

image

From the experimental results, we've observed that our distilled small vision model (InternViT-300M) is well-suited for a smaller language model (1.8B or 3.8B). This combination maximizes efficiency while maintaining impressive performance across various benchmarks, demonstrating the effectiveness of small models in handling complex tasks. Additionally, our small model significantly reduces memory requirements, making it more accessible and efficient for practical use.

Performance

Comparison with SoTA models on 16 multimodal benchmarks. OCR-related benchmarks include: DocVQA test, ChartQA average test, InfographicVQA test, TextVQA val, and OCRBench. General multimodal benchmarks encompass: MME, RealWorldQA, AI2D test, MMMU val, MMBench-EN/CN test, CCBench dev, MMVet, SEED Image, and HallusionBench. Additionally, the math dataset includes MathVista testmini. The MME results we report are the sum of the perception and cognition scores. The results of OCRBench, MMBench, CCBench, and HallusionBench are collected from the OpenCompass leaderboard.

model open-
source
#param DocVQA
(test)
ChartQA
(test)
InfoVQA
(test)
TextVQA
(val)
OCR
Bench
MME RWQA AI2D
(test)
MMMU
(val)
MMB-EN/CN
(test)
CCBench
(dev)
MMVet SEED
(image)
HallB MathVista
(mini)
GPT-4V โœ— - 88.4 78.5 - 78.0 645 1926.6 61.4 78.2 56.8 77.0 / 74.4 46.5 67.6 71.6 46.5 49.9
Gemini Pro 1.0 โœ— - 88.1 74.1 75.2 74.6 659 1933.4 - 73.9 47.9 73.6 / 74.3 52.5 64.3 70.7 45.2 45.2
Gemini Pro 1.5 โœ— - 86.5 81.3 72.7 73.5 - - 67.5 80.3 58.5 - / - - - - - 52.1
Qwen-VL-Plus โœ— - 91.4 78.1 - - 694 2183.4 - 75.9 45.2 67.0 / 70.7 55.1 61.1 72.7 40.6 43.3
Claude-3 Haiku โœ— - 88.8 81.7 - - 658 1453.2 - 86.7 50.2 60.7 / 57.2 24.5 - - 39.2 46.4
Step-1V โœ— 100B - - - - 625 2206.4 - 79.2 49.9 80.7 / 79.9 71.2 63.3 70.3 48.4 44.8
Grok-1.5V โœ— - 85.6 76.1 - 78.1 - - 68.7 88.3 - - / - - - - - 52.8
LLaVA-NeXT-34B โœ“ 35B 84.0 68.7 51.5 69.5 574 2028.0 - 74.9 51.1 81.1 / 79.0 49.2 57.4 75.9 34.8 46.5
LLaVA-NeXT-110B โœ“ 112B 85.7(val) 79.7 - - - 2200.4 63.1 80.4 49.1 - / - - - - - 49.0
InternVL 1.2 โœ“ 40B 57.7 68.0 39.5 72.5 569 2175.4 67.5 79.0 51.6 82.2 / 81.2 59.2 48.9 75.6 47.6 47.7
InternVL 1.5 โœ“ 25.5B 90.9 83.8 72.5 80.6 724 2187.8 66.0 80.7 45.2 82.2 / 82.0 69.8 62.8 76.0 49.3 53.5
MobileVLM-V2-1.7B โœ“ 1.7B - - - 52.1 - - - - - - - - - - -
MobileVLM-V2-3B โœ“ 3.0B - - - 57.5 - - - - - - - - - - -
Mini-Gemini-2B โœ“ 3.5B 34.2 - - 56.2 - 1653.0 - - 31.7 - / - - 31.1 - - 29.4
Bunny-v1.0-3B โœ“ 3.2B - - - - - 1778.1 - - 38.2 69.2 / - - - - - -
Bunny-v1.1-4B โœ“ 4.3B - - - - - 1866.8 - - 40.2 74.1 / 66.3 - - 71.7 - -
DeepSeek-VL-1.3B โœ“ 2.0B - - - 57.8 409 1531.6 49.7 51.5 32.2 66.4 / 62.9 37.6 34.8 66.7 27.6 31.1
PaliGemma-3B โœ“ 2.9B - - - 68.1 614 1686.1 55.2 68.3 34.9 71.0 / 63.6 29.6 33.1 69.6 32.2 28.7
MiniCPM-V โœ“ 3.4B 38.2 - - 60.6 366 1650.2 51.2 56.3 38.3 64.1 / 62.6 41.4 31.1 65.6 36.2 28.9
MiniCPM-V-2 โœ“ 3.4B 71.9 - - 74.1 605 1808.6 55.8 62.9 38.2 69.1 / 66.5 45.3 41.0 67.1 36.1 38.7
Phi-3-vision-128k-instruct โœ“ 4.2B - 81.4 - 70.9 639 1508.0 58.8 76.7 40.4 73.6 / - 24.1 - 70.9 39.0 44.5
Mini-InternVL-2B-1.5 โœ“ 2.2B 85.0 74.8 55.4 70.5 654 1901.5 57.9 69.8 34.6 70.9 / 66.2 63.5 39.3 69.8 37.5 41.1
Percent of InternVL-1.5 8.6% 93.5% 89.3% 76.4% 87.5% 90.3% 86.9% 87.7% 86.5% 76.5% 83.5% 91.0% 62.6% 91.8% 76.1% 76.8%
Mini-InternVL-4B-1.5 โœ“ 4.2B 87.7 81.0 64.6 72.5 638 2053.6 60.1 76.9 43.3 76.2 / 70.3 58.8 46.7 72.5 42.8 53.7
Percent of InternVL-1.5 16.5% 96.5% 96.7% 89.1% 90.0% 88.1% 93.9% 91.1% 95.3% 95.8% 89.2% 84.2% 74.4% 95.4% 86.8% 100.0%

Model Card

NameMini-InternVL-Chat-2B-V1-5Mini-InternVL-Chat-4B-V1-5
Model SizeTotal2.21B4.15B
ViT304.01M304.01M
MLP12.60M22.03M
LLM1.89B3.82B
Resolutiondynamic resolution, max to 12 tiles of 448 ร— 448 in training, max to 40 tiles in testing (4K resolution).
Stage-1Training DataThe pre-training dataset utilized in our InternVL 1.5 encompasses a diverse range of publicly accessible sources. These datasets span multiple tasks, including captioning, which predominantly uses datasets such as Laion-EN, Laion-ZH, COYO, and GRIT, constituting 53.9% of the total data. Detection and grounding tasks utilize datasets like Objects365, GRIT, and All-Seeing, making up 5.2%. For OCR tasks, we utilized large-scale datasets such as Wukong-OCR, LaionCOCO-OCR, and Common Crawl PDFs, which constitute 32.0% of our data. These datasets were constructed using PaddleOCR to perform OCR on Chinese images from Wukong and on English images from LaionCOCO. Smaller OCR datasets include MMC-Inst, LSVT, ST-VQA, RCTW-17, ArT, and others, accounting for 8.9% of the data, which focus on more specific or constrained OCR challenges.
Trainable ModuleViT + MLPMLP
Stage-2Training Data5M high-quality bilingual data. Please see the technical report of InternVL 1.5 for more details.
Trainable ModuleViT + MLP + LLM

The hyperparameters used for pre-training and fine-tuning are listed in the following table.

Size Stage Trainable Module #Sample Drop Path Batch Size LR Epoch Max Length Weight Decay Config Download
2B Pretrain ViT + MLP ~200M 0.1 2048 2e-5 1 4096 0.01 Link ViT / MLP
Finetune ViT + MLP + LLM ~5M 0.1 1024 4e-5 1 8192 0.01 Link MLLM
4B Pretrain MLP ~24M 0.0 2048 2e-4 1 4096 0.05 Link MLP
Finetune ViT + MLP + LLM ~5M 0.1 1024 4e-5 1 8192 0.05 Link MLLM

Citation


  @article{chen2023internvl,
      title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
      author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
      journal={arXiv preprint arXiv:2312.14238},
      year={2023}
  }
  @article{chen2024far,
    title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
    author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
    journal={arXiv preprint arXiv:2404.16821},
    year={2024}
  }
  

๐Ÿ”™ Go Back