InternVL3.5

Type	Model	Date	HF Link	MS Link	Document
Multimodal Large Language Models (Github Format)	InternVL3_5-1B	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-2B	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-4B	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-8B	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-14B	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-38B	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-20B-A4B	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-30B-A3B	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-241B-A28B	2025.08.26	🤗 link	🤖 link	📖 doc
Multimodal Large Language Models (HuggingFace Format)	InternVL3_5-1B-HF	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-2B-HF	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-4B-HF	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-8B-HF	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-14B-HF	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-38B-HF	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-20B-A4B-HF	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-30B-A3B-HF	2025.08.26	🤗 link	🤖 link	📖 doc
InternVL3_5-241B-A28B-HF	2025.08.26	🤗 link	🤖 link	📖 doc

We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (Cascade RL). In Cascade RL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch.

Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline.

Model	Training Pipeline	HF Link	MS Link
InternVL3_5-1B-Pretrained	CPT	🤗 link	🤖 link
InternVL3_5-1B-Instruct	CPT + SFT	🤗 link	🤖 link
InternVL3_5-1B-MPO	CPT + SFT + MPO	🤗 link	🤖 link
InternVL3_5-1B	CPT + SFT + Cascade RL	🤗 link	🤖 link
InternVL3_5-2B-Pretrained	CPT	🤗 link	🤖 link
InternVL3_5-2B-Instruct	CPT + SFT	🤗 link	🤖 link
InternVL3_5-2B-MPO	CPT + SFT + MPO	🤗 link	🤖 link
InternVL3_5-2B	CPT + SFT + Cascade RL	🤗 link	🤖 link
InternVL3_5-4B-Pretrained	CPT	🤗 link	🤖 link
InternVL3_5-4B-Instruct	CPT + SFT	🤗 link	🤖 link
InternVL3_5-4B-MPO	CPT + SFT + MPO	🤗 link	🤖 link
InternVL3_5-4B	CPT + SFT + Cascade RL	🤗 link	🤖 link
InternVL3_5-8B-Pretrained	CPT	🤗 link	🤖 link
InternVL3_5-8B-Instruct	CPT + SFT	🤗 link	🤖 link
InternVL3_5-8B-MPO	CPT + SFT + MPO	🤗 link	🤖 link
InternVL3_5-8B	CPT + SFT + Cascade RL	🤗 link	🤖 link
InternVL3_5-14B-Pretrained	CPT	🤗 link	🤖 link
InternVL3_5-14B-Instruct	CPT + SFT	🤗 link	🤖 link
InternVL3_5-14B-MPO	CPT + SFT + MPO	🤗 link	🤖 link
InternVL3_5-14B	CPT + SFT + Cascade RL	🤗 link	🤖 link
InternVL3_5-30B-A3B-Pretrained	CPT	🤗 link	🤖 link
InternVL3.5-30B-A3B-Instruct	CPT + SFT	🤗 link	🤖 link
InternVL3.5-30B-A3B-MPO	CPT + SFT + MPO	🤗 link	🤖 link
InternVL3.5-30B-A3B	CPT + SFT + Cascade RL	🤗 link	🤖 link
InternVL3_5-38B-Pretrained	CPT	🤗 link	🤖 link
InternVL3_5-38B-Instruct	CPT + SFT	🤗 link	🤖 link
InternVL3_5-38B-MPO	CPT + SFT + MPO	🤗 link	🤖 link
InternVL3_5-38B	CPT + SFT + Cascade RL	🤗 link	🤖 link
InternVL3.5-241B-A28B-Pretrained	CPT	🤗 link	🤖 link
InternVL3.5-241B-A28B-Instruct	CPT + SFT	🤗 link	🤖 link
InternVL3.5-241B-A28B-MPO	CPT + SFT + MPO	🤗 link	🤖 link
InternVL3.5-241B-A28B	CPT + SFT + Cascade RL	🤗 link	🤖 link

The Flash version of our model will be released as soon as possible.

InternVL3.5 family is built upon the following designs:

Dynamic High-Resolution Architecture: We follow the ViT-MLP-LLM paradigm of earlier InternVL models. The language model is initialized from Qwen3 and GPT-OSS, while the vision encoder adopts InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy, introduced in InternVL1.5, is also retained to improve image understanding at varying resolutions.
Visual Resolution Router: InternVL3.5-Flash incorporates the Visual Resolution Router (ViR) to enable adaptive compression of visual tokens. Each image patch is routed to an appropriate compression rate depending on its semantic richness. This patch-aware routing reduces the number of tokens by up to 50% while maintaining nearly full performance, making the Flash variants highly efficient in resource-constrained scenarios.

Progressive Training with Cascade RL: Training integrates pre-training, supervised fine-tuning, and a cascade reinforcement learning paradigm. Offline RL warm-up with Mixed Preference Optimization (MPO) is followed by online RL with GSPO, striking a balance between training efficiency and quality of reasoning.
Visual Consistency Learning (ViCO): We introduce ViCO to align responses across different compression rates of visual tokens, enabling efficient variants such as InternVL3.5-Flash. A two-stage process—consistency training with KL divergence and router training—ensures robustness under varying token resolutions.
Enhanced Test-Time Scaling: InternVL3.5 adopts a comprehensive test-time scaling strategy, combining deep thinking (step-by-step reasoning in Thinking mode) and parallel thinking (Best-of-N with VisualPRM-v1.1 as critic). This improves logical depth and breadth in reasoning tasks.
Decoupled Vision-Language Deployment: To reduce inference cost and blocking between vision and language computation, InternVL3.5 introduces decoupled vision-language deployment (DvD). Vision modules (ViT, MLP, ViR) run independently on vision servers, while the language model runs on language servers. This design enables asynchronous pipelining, better throughput, and flexible system optimization.

Model Card

Name		1B	2B	4B	8B	14B	38B	20B-A4B	30B-A3B	241B-A28B
Model Size	Total	1.06B	2.35B	4.73B	8.53B	15.12B	38.40B	21.23B-A4B	30.85B-A3B	240.70B-A28B
	ViT	304.01M	304.01M	304.01M	304.01M	304.01M	5.54B	304.01M	304.01M	5.54B
	MLP	5.25M	12.60M	17.05M	33.57M	47.20M	91.79M	20.10M	12.60M	69.24M
	LLM	751.63M	2.03B	4.41B	8.19B	14.77B	32.76B	20.91B	30.53B	235.09B
Resolution		dynamic resolution, max to 36 tiles of 448 × 448 in training, max to 128 tiles in testing.
CPT	Training Data	The pre-training corpora can be classified into two categories: (1) Multimodal data: this subset of data is mainly sourced from the training corpora of InternVL3, covering a diverse range of domains such as image captioning, general question answering, mathematics, scientific disciplines, charts, optical character recognition (OCR), knowledge grounding, document understanding, multi-turn dialogue, and medical data. (2) Text-only data: this part of data is constructed based on the training corpora of InternLM series and is further augmented with open-source datasets. The pre-training corpora contains approximately 116M samples, corresponding to about 250B tokens. The ratio between text-only and multimodal data is approximately 1:2.5. The maximum sequence length is set to 32K tokens to adapt long-context understanding and reasoning.
CPT	Trainable Module	ViT + MLP + LLM
SFT	Training Data	The datasets for SFT stage comprise approximately 56 million samples, which corresponds to around 130 billion tokens. The proportion of text-only data to multimodal data is roughly 1:3.5. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision-language tasks. (2) Multimodal reasoning data in the “Thinking” mode, which are included to instill long-thinking capabilities in the model. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vector graphics (SVG) understanding and generation.
SFT	Trainable Module	ViT + MLP + LLM
Cascade RL	Training Data	We use MMPR v1.2 as the training data for offline RL, which contains about 200K sample pairs. Based on MMPR-v1.2, we compute the accuracy of each query using the provided rollouts and select those whose model accuracy falls between 0.2 and 0.8 for online RL. We further extend the dataset with recent multimodal datasets to enhance diversity. The resulting dataset, termed MMPR-Tiny, consists of approximately 70K queries. We directly reuse the rollouts from MMPR-v1.2 for both offline RL and data filtering in online RL, thereby reducing the cost of sampling additional rollouts.
Cascade RL	Trainable Module	ViT + MLP + LLM
ViCO	Training Data	We primarily leverage datasets identical to the SFT stage during consistency training, ensuring that the model retains its original performance. During router training, we use a subset of the SFT data, primarily composed of OCR and VQA examples, which are rich in visual information and sometimes require high-resolution understanding
ViCO	Trainable Module	ViT + MLP + LLM

Performance

Please see our paper for more experimental results.

Citation


  @article{wang2025internvl3,
    title={InternVL3. 5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
    author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
    journal={arXiv preprint arXiv:2508.18265},
    year={2025}
  }

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Model Card

Performance

Citation

🔙 Go Back