Mono-InternVL

Abstract

In this paper, we focus on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. In particular, we identify that existing pre-training strategies for monolithic MLLMs often suffer from unstable optimization or catastrophic forgetting. To address this issue, our core idea is to embed a new visual parameter space into a pre-trained LLM, thereby stably learning visual knowledge from noisy data while freezing the LLM. Based on this principle, we present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. Moreover, we propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data. To validate our approach, we conduct extensive experiments on 16 benchmarks. Experimental results confirm the superior performance of Mono-InternVL than existing monolithic MLLMs on 13 of 16 multimodal benchmarks, e.g., +80 points over Emu3 on OCRBench. Compared to the modular baseline, i.e., InternVL-1.5, Mono-InternVL still retains comparable multimodal performance while reducing up to 67% first token latency.

The Monolithic Architecture

Mono-InternVL consists of tokenizers and a multimodal mixture-of-experts structure.

(1) Visual and textual embeddings. Compared to modular MLLMs, Mono-InternVL directly patchifies images to input visual sequences using a lightweight module.
(2) Multimodal mixture-of-experts structure. The key principle of Mono-InternVL is to embed visual experts into a pre-trained LLM, thereby facilitating visual pre-training using the pre-trained LLM’s knowledge and significantly mitigating the issue of catastrophic forgetting.

Endogenous Visual Pre-training

Based on the monolithic architecture, we present an innovative visual pre-training method called Endogenous Visual Pre-training (EViP). Specifically, EViP is formulated as a progressive learning process of three stages: 1) concept learning to grasp basic visual concepts, 2) semantic learning to capture high-level semantics, e.g., world knowledge, and 3) alignment learning to align knowledge with downstream tasks. Benefiting from the architecture and the pre-training strategy, the visual scalability of Mono-InternVL is fully unleashed, where the downstream performance consistently improves as the scale of the pre-training data increases. After visual pre-training, Mono-InternVL accommodates complex multimodal tasks via supervised instruction tuning.

Performance

Examples

Citation


  @article{luo2024mono,
    title={Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training},
    author={Luo, Gen and Yang, Xue and Dou, Wenhan and Wang, Zhaokai and Liu, Jiawen and Dai, Jifeng and Qiao, Yu and Zhu, Xizhou},
    journal={arXiv preprint arXiv:2410.08202},
    year={2024}
  }

  @article{chen2024far,
    title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
    author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
    journal={arXiv preprint arXiv:2404.16821},
    year={2024}
  }

  @inproceedings{chen2024internvl,
    title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
    author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={24185--24198},
    year={2024}
  }

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

CVPR 2025