image

We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple designs:

  1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model——InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
  2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 × 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
  3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.

Method

As illustrated in Figure 3, InternVL 1.5 employs an architecture akin to widely-used open-source MLLMs, specifically the “ViT-MLP-LLM” configuration referenced in various existing studies. Our implementation of this architecture integrates a pre-trained InternViT-6B with a pre-trained InternLM2-20B using a randomly initialized MLP projector. During training, we implemented a dynamic resolution strategy, dividing images into tiles of 448 × 448 pixels in sizes ranging from 1 to 12, based on the aspect ratio and resolution of the input images. During testing, this can be zero-shot scaled up to 40 tiles (i.e., 4K resolution). To enhance scalability for high resolution, we simply employed a pixel shuffle operation to reduce the number of visual tokens to one-quarter of the original. Therefore, in our model, a 448 × 448 image is represented by 256 visual tokens.

image

Model Card

NameInternVL-Chat-V1-5InternVL-Chat-V1-5-Plus
Model SizeTotal25.51B40.07B
ViT5.54B5.54B
MLP116.43M143.17M
LLM19.86B34.39B
Resolutiondynamic resolution, max to 12 tiles of 448 × 448 in training, max to 40 tiles in testing (4K resolution).
Stage-1Training DataThe pre-training dataset utilized in our InternVL 1.5 encompasses a diverse range of publicly accessible sources. These datasets span multiple tasks, including captioning, which predominantly uses datasets such as Laion-EN, Laion-ZH, COYO, and GRIT, constituting 53.9% of the total data. Detection and grounding tasks utilize datasets like Objects365, GRIT, and All-Seeing, making up 5.2%. For OCR tasks, we utilized large-scale datasets such as Wukong-OCR, LaionCOCO-OCR, and Common Crawl PDFs, which constitute 32.0% of our data. These datasets were constructed using PaddleOCR to perform OCR on Chinese images from Wukong and on English images from LaionCOCO. Smaller OCR datasets include MMC-Inst, LSVT, ST-VQA, RCTW-17, ArT, and others, accounting for 8.9% of the data, which focus on more specific or constrained OCR challenges.
Trainable ModuleViT + MLPMLP
Stage-2Training Data5M high-quality bilingual data. Please see our technical report for more details.
Trainable ModuleViT + MLP + LLM

The hyperparameters used for pre-training and fine-tuning are listed in the following table.

Size Stage Trainable Module #Sample Drop Path Batch Size LR Epoch Max Length Weight Decay Config Download
26B Pretrain ViT + MLP ~200M 0.2 2048 1e-5 1 4096 0.05 Link ViT / MLP
Finetune ViT + MLP + LLM ~5M 0.4 1024 2e-5 1 4096 0.05 Link MLLM
40B Pretrain MLP ~3M 0.0 2048 1e-4 1 4096 0.05 Link MLP
Finetune ViT + MLP + LLM ~5M 0.4 1024 2e-5 1 4096 0.05 Link -

Performance

image

image

Examples

image/png image/png image/png image/png image/png image/png

Citation


  @article{chen2023internvl,
      title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
      author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
      journal={arXiv preprint arXiv:2312.14238},
      year={2023}
  }
  @article{chen2024far,
    title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
    author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
    journal={arXiv preprint arXiv:2404.16821},
    year={2024}
  }
  

🔙 Go Back