image


We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (Cascade RL). In Cascade RL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch.

image


Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline.

Model Training Pipeline HF Link MS Link
InternVL3_5-1B-Pretrained CPT πŸ€— link πŸ€– link
InternVL3_5-1B-Instruct CPT + SFT πŸ€— link πŸ€– link
InternVL3_5-1B-MPO CPT + SFT + MPO πŸ€— link πŸ€– link
InternVL3_5-1B CPT + SFT + Cascade RL πŸ€— link πŸ€– link
InternVL3_5-2B-Pretrained CPT πŸ€— link πŸ€– link
InternVL3_5-2B-Instruct CPT + SFT πŸ€— link πŸ€– link
InternVL3_5-2B-MPO CPT + SFT + MPO πŸ€— link πŸ€– link
InternVL3_5-2B CPT + SFT + Cascade RL πŸ€— link πŸ€– link
InternVL3_5-4B-Pretrained CPT πŸ€— link πŸ€– link
InternVL3_5-4B-Instruct CPT + SFT πŸ€— link πŸ€– link
InternVL3_5-4B-MPO CPT + SFT + MPO πŸ€— link πŸ€– link
InternVL3_5-4B CPT + SFT + Cascade RL πŸ€— link πŸ€– link
InternVL3_5-8B-Pretrained CPT πŸ€— link πŸ€– link
InternVL3_5-8B-Instruct CPT + SFT πŸ€— link πŸ€– link
InternVL3_5-8B-MPO CPT + SFT + MPO πŸ€— link πŸ€– link
InternVL3_5-8B CPT + SFT + Cascade RL πŸ€— link πŸ€– link
InternVL3_5-14B-Pretrained CPT πŸ€— link πŸ€– link
InternVL3_5-14B-Instruct CPT + SFT πŸ€— link πŸ€– link
InternVL3_5-14B-MPO CPT + SFT + MPO πŸ€— link πŸ€– link
InternVL3_5-14B CPT + SFT + Cascade RL πŸ€— link πŸ€– link
InternVL3_5-30B-A3B-Pretrained CPT πŸ€— link πŸ€– link
InternVL3.5-30B-A3B-Instruct CPT + SFT πŸ€— link πŸ€– link
InternVL3.5-30B-A3B-MPO CPT + SFT + MPO πŸ€— link πŸ€– link
InternVL3.5-30B-A3B CPT + SFT + Cascade RL πŸ€— link πŸ€– link
InternVL3_5-38B-Pretrained CPT πŸ€— link πŸ€– link
InternVL3_5-38B-Instruct CPT + SFT πŸ€— link πŸ€– link
InternVL3_5-38B-MPO CPT + SFT + MPO πŸ€— link πŸ€– link
InternVL3_5-38B CPT + SFT + Cascade RL πŸ€— link πŸ€– link
InternVL3.5-241B-A28B-Pretrained CPT πŸ€— link πŸ€– link
InternVL3.5-241B-A28B-Instruct CPT + SFT πŸ€— link πŸ€– link
InternVL3.5-241B-A28B-MPO CPT + SFT + MPO πŸ€— link πŸ€– link
InternVL3.5-241B-A28B CPT + SFT + Cascade RL πŸ€— link πŸ€– link

The Flash version of our model will be released as soon as possible.

InternVL3.5 family is built upon the following designs:

  1. Dynamic High-Resolution Architecture: We follow the ViT-MLP-LLM paradigm of earlier InternVL models. The language model is initialized from Qwen3 and GPT-OSS, while the vision encoder adopts InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy, introduced in InternVL1.5, is also retained to improve image understanding at varying resolutions.
  2. Visual Resolution Router: InternVL3.5-Flash incorporates the Visual Resolution Router (ViR) to enable adaptive compression of visual tokens. Each image patch is routed to an appropriate compression rate depending on its semantic richness. This patch-aware routing reduces the number of tokens by up to 50% while maintaining nearly full performance, making the Flash variants highly efficient in resource-constrained scenarios.
  3. image

  4. Progressive Training with Cascade RL: Training integrates pre-training, supervised fine-tuning, and a cascade reinforcement learning paradigm. Offline RL warm-up with Mixed Preference Optimization (MPO) is followed by online RL with GSPO, striking a balance between training efficiency and quality of reasoning.
  5. Visual Consistency Learning (ViCO): We introduce ViCO to align responses across different compression rates of visual tokens, enabling efficient variants such as InternVL3.5-Flash. A two-stage processβ€”consistency training with KL divergence and router trainingβ€”ensures robustness under varying token resolutions.
  6. Enhanced Test-Time Scaling: InternVL3.5 adopts a comprehensive test-time scaling strategy, combining deep thinking (step-by-step reasoning in Thinking mode) and parallel thinking (Best-of-N with VisualPRM-v1.1 as critic). This improves logical depth and breadth in reasoning tasks.
  7. Decoupled Vision-Language Deployment: To reduce inference cost and blocking between vision and language computation, InternVL3.5 introduces decoupled vision-language deployment (DvD). Vision modules (ViT, MLP, ViR) run independently on vision servers, while the language model runs on language servers. This design enables asynchronous pipelining, better throughput, and flexible system optimization.

Model Card

Name 1B 2B 4B 8B 14B 38B 20B-A4B 30B-A3B 241B-A28B
Model Size Total 1.06B 2.35B 4.73B 8.53B 15.12B 38.40B 21.23B-A4B 30.85B-A3B 240.70B-A28B
ViT 304.01M 304.01M 304.01M 304.01M 304.01M 5.54B 304.01M 304.01M 5.54B
MLP 5.25M 12.60M 17.05M 33.57M 47.20M 91.79M 20.10M 12.60M 69.24M
LLM 751.63M 2.03B 4.41B 8.19B 14.77B 32.76B 20.91B 30.53B 235.09B
Resolution dynamic resolution, max to 36 tiles of 448 Γ— 448 in training, max to 128 tiles in testing.
CPT Training Data The pre-training corpora can be classified into two categories: (1) Multimodal data: this subset of data is mainly sourced from the training corpora of InternVL3, covering a diverse range of domains such as image captioning, general question answering, mathematics, scientific disciplines, charts, optical character recognition (OCR), knowledge grounding, document understanding, multi-turn dialogue, and medical data. (2) Text-only data: this part of data is constructed based on the training corpora of InternLM series and is further augmented with open-source datasets. The pre-training corpora contains approximately 116M samples, corresponding to about 250B tokens. The ratio between text-only and multimodal data is approximately 1:2.5. The maximum sequence length is set to 32K tokens to adapt long-context understanding and reasoning.
Trainable Module ViT + MLP + LLM
SFT Training Data The datasets for SFT stage comprise approximately 56 million samples, which corresponds to around 130 billion tokens. The proportion of text-only data to multimodal data is roughly 1:3.5. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision-language tasks. (2) Multimodal reasoning data in the β€œThinking” mode, which are included to instill long-thinking capabilities in the model. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vector graphics (SVG) understanding and generation.
Trainable Module ViT + MLP + LLM
Cascade RL Training Data We use MMPR v1.2 as the training data for offline RL, which contains about 200K sample pairs. Based on MMPR-v1.2, we compute the accuracy of each query using the provided rollouts and select those whose model accuracy falls between 0.2 and 0.8 for online RL. We further extend the dataset with recent multimodal datasets to enhance diversity. The resulting dataset, termed MMPR-Tiny, consists of approximately 70K queries. We directly reuse the rollouts from MMPR-v1.2 for both offline RL and data filtering in online RL, thereby reducing the cost of sampling additional rollouts.
Trainable Module ViT + MLP + LLM
ViCO Training Data We primarily leverage datasets identical to the SFT stage during consistency training, ensuring that the model retains its original performance. During router training, we use a subset of the SFT data, primarily composed of OCR and VQA examples, which are rich in visual information and sometimes require high-resolution understanding
Trainable Module ViT + MLP + LLM

Performance

image

Please see our paper for more experimental results.

Citation


  @article{wang2025internvl3,
    title={InternVL3. 5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency},
    author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
    journal={arXiv preprint arXiv:2508.18265},
    year={2025}
  }


πŸ”™ Go Back