InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
[π Go Back] [π InternVL3.5 Report] [π InternVL3 Report]
[π InternVL 2.5 Report] [π
InternVL 1.0 Paper] [π InternVL
1.5 Paper] [π¨οΈ Chat Demo] [π GitHub] [π Documents] [π€ HF Demo] [ ModelScope] [π
Quick Start]
Type | Model | Date | HF Link | MS Link | Document |
---|---|---|---|---|---|
Multimodal Large Language Models (Github Format) | InternVL3_5-1B | 2025.08.26 | π€ link | π€ link | π doc |
InternVL3_5-2B | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-4B | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-8B | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-14B | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-38B | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-20B-A4B | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-30B-A3B | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-241B-A28B | 2025.08.26 | π€ link | π€ link | π doc | |
Multimodal Large Language Models (HuggingFace Format) | InternVL3_5-1B-HF | 2025.08.26 | π€ link | π€ link | π doc |
InternVL3_5-2B-HF | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-4B-HF | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-8B-HF | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-14B-HF | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-38B-HF | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-20B-A4B-HF | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-30B-A3B-HF | 2025.08.26 | π€ link | π€ link | π doc | |
InternVL3_5-241B-A28B-HF | 2025.08.26 | π€ link | π€ link | π doc |
We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβnarrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (Cascade RL). In Cascade RL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch.
Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline.
Model | Training Pipeline | HF Link | MS Link |
---|---|---|---|
InternVL3_5-1B-Pretrained | CPT | π€ link | π€ link |
InternVL3_5-1B-Instruct | CPT + SFT | π€ link | π€ link |
InternVL3_5-1B-MPO | CPT + SFT + MPO | π€ link | π€ link |
InternVL3_5-1B | CPT + SFT + Cascade RL | π€ link | π€ link |
InternVL3_5-2B-Pretrained | CPT | π€ link | π€ link |
InternVL3_5-2B-Instruct | CPT + SFT | π€ link | π€ link |
InternVL3_5-2B-MPO | CPT + SFT + MPO | π€ link | π€ link |
InternVL3_5-2B | CPT + SFT + Cascade RL | π€ link | π€ link |
InternVL3_5-4B-Pretrained | CPT | π€ link | π€ link |
InternVL3_5-4B-Instruct | CPT + SFT | π€ link | π€ link |
InternVL3_5-4B-MPO | CPT + SFT + MPO | π€ link | π€ link |
InternVL3_5-4B | CPT + SFT + Cascade RL | π€ link | π€ link |
InternVL3_5-8B-Pretrained | CPT | π€ link | π€ link |
InternVL3_5-8B-Instruct | CPT + SFT | π€ link | π€ link |
InternVL3_5-8B-MPO | CPT + SFT + MPO | π€ link | π€ link |
InternVL3_5-8B | CPT + SFT + Cascade RL | π€ link | π€ link |
InternVL3_5-14B-Pretrained | CPT | π€ link | π€ link |
InternVL3_5-14B-Instruct | CPT + SFT | π€ link | π€ link |
InternVL3_5-14B-MPO | CPT + SFT + MPO | π€ link | π€ link |
InternVL3_5-14B | CPT + SFT + Cascade RL | π€ link | π€ link |
InternVL3_5-30B-A3B-Pretrained | CPT | π€ link | π€ link |
InternVL3.5-30B-A3B-Instruct | CPT + SFT | π€ link | π€ link |
InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | π€ link | π€ link |
InternVL3.5-30B-A3B | CPT + SFT + Cascade RL | π€ link | π€ link |
InternVL3_5-38B-Pretrained | CPT | π€ link | π€ link |
InternVL3_5-38B-Instruct | CPT + SFT | π€ link | π€ link |
InternVL3_5-38B-MPO | CPT + SFT + MPO | π€ link | π€ link |
InternVL3_5-38B | CPT + SFT + Cascade RL | π€ link | π€ link |
InternVL3.5-241B-A28B-Pretrained | CPT | π€ link | π€ link |
InternVL3.5-241B-A28B-Instruct | CPT + SFT | π€ link | π€ link |
InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | π€ link | π€ link |
InternVL3.5-241B-A28B | CPT + SFT + Cascade RL | π€ link | π€ link |
The Flash version of our model will be released as soon as possible.
InternVL3.5 family is built upon the following designs:
- Dynamic High-Resolution Architecture: We follow the ViT-MLP-LLM paradigm of earlier InternVL models. The language model is initialized from Qwen3 and GPT-OSS, while the vision encoder adopts InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy, introduced in InternVL1.5, is also retained to improve image understanding at varying resolutions.
- Visual Resolution Router: InternVL3.5-Flash incorporates the Visual Resolution Router (ViR) to enable adaptive compression of visual tokens. Each image patch is routed to an appropriate compression rate depending on its semantic richness. This patch-aware routing reduces the number of tokens by up to 50% while maintaining nearly full performance, making the Flash variants highly efficient in resource-constrained scenarios.
- Progressive Training with Cascade RL: Training integrates pre-training, supervised fine-tuning, and a cascade reinforcement learning paradigm. Offline RL warm-up with Mixed Preference Optimization (MPO) is followed by online RL with GSPO, striking a balance between training efficiency and quality of reasoning.
- Visual Consistency Learning (ViCO): We introduce ViCO to align responses across different compression rates of visual tokens, enabling efficient variants such as InternVL3.5-Flash. A two-stage processβconsistency training with KL divergence and router trainingβensures robustness under varying token resolutions.
- Enhanced Test-Time Scaling: InternVL3.5 adopts a comprehensive test-time scaling strategy, combining deep thinking (step-by-step reasoning in Thinking mode) and parallel thinking (Best-of-N with VisualPRM-v1.1 as critic). This improves logical depth and breadth in reasoning tasks.
- Decoupled Vision-Language Deployment: To reduce inference cost and blocking between vision and language computation, InternVL3.5 introduces decoupled vision-language deployment (DvD). Vision modules (ViT, MLP, ViR) run independently on vision servers, while the language model runs on language servers. This design enables asynchronous pipelining, better throughput, and flexible system optimization.
Model Card
Name |
|
|
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|---|---|---|
Model Size | Total | 1.06B | 2.35B | 4.73B | 8.53B | 15.12B | 38.40B | 21.23B-A4B | 30.85B-A3B | 240.70B-A28B |
ViT | 304.01M | 304.01M | 304.01M | 304.01M | 304.01M | 5.54B | 304.01M | 304.01M | 5.54B | |
MLP | 5.25M | 12.60M | 17.05M | 33.57M | 47.20M | 91.79M | 20.10M | 12.60M | 69.24M | |
LLM | 751.63M | 2.03B | 4.41B | 8.19B | 14.77B | 32.76B | 20.91B | 30.53B | 235.09B | |
Resolution | dynamic resolution, max to 36 tiles of 448 Γ 448 in training, max to 128 tiles in testing. | |||||||||
|
Training Data | The pre-training corpora can be classified into two categories: (1) Multimodal data: this subset of data is mainly sourced from the training corpora of InternVL3, covering a diverse range of domains such as image captioning, general question answering, mathematics, scientific disciplines, charts, optical character recognition (OCR), knowledge grounding, document understanding, multi-turn dialogue, and medical data. (2) Text-only data: this part of data is constructed based on the training corpora of InternLM series and is further augmented with open-source datasets. The pre-training corpora contains approximately 116M samples, corresponding to about 250B tokens. The ratio between text-only and multimodal data is approximately 1:2.5. The maximum sequence length is set to 32K tokens to adapt long-context understanding and reasoning. | ||||||||
Trainable Module | ViT + MLP + LLM |
|
Training Data | The datasets for SFT stage comprise approximately 56 million samples, which corresponds to around 130 billion tokens. The proportion of text-only data to multimodal data is roughly 1:3.5. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision-language tasks. (2) Multimodal reasoning data in the βThinkingβ mode, which are included to instill long-thinking capabilities in the model. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vector graphics (SVG) understanding and generation. | ||||||
Trainable Module | ViT + MLP + LLM | |||||||||
|
Training Data | We use MMPR v1.2 as the training data for offline RL, which contains about 200K sample pairs. Based on MMPR-v1.2, we compute the accuracy of each query using the provided rollouts and select those whose model accuracy falls between 0.2 and 0.8 for online RL. We further extend the dataset with recent multimodal datasets to enhance diversity. The resulting dataset, termed MMPR-Tiny, consists of approximately 70K queries. We directly reuse the rollouts from MMPR-v1.2 for both offline RL and data filtering in online RL, thereby reducing the cost of sampling additional rollouts. | ||||||||
Trainable Module | ViT + MLP + LLM | |||||||||
|
Training Data | We primarily leverage datasets identical to the SFT stage during consistency training, ensuring that the model retains its original performance. During router training, we use a subset of the SFT data, primarily composed of OCR and VQA examples, which are rich in visual information and sometimes require high-resolution understanding | ||||||||
Trainable Module | ViT + MLP + LLM |
Performance
Citation
@article{wang2025internvl3,
title={InternVL3. 5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
journal={arXiv preprint arXiv:2508.18265},
year={2025}
}