Type |
Model |
Date |
Download |
Note |
Vision-Language Foundation Model |
InternViT-6B-224px |
2023.12.22 |
π€ HF link |
vision foundation model |
InternVL-14B-224px |
2023.12.22 |
π€ HF link |
vision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP |
Vision Large Language Model |
InternVL-Chat-19B-448px |
2024.02.03 |
π€ HF link |
448 resolution |
InternVL-Chat-19B |
2023.12.25 |
π€ HF link |
English multimodal dialogue |
InternVL-Chat-13B |
2023.12.25 |
π€ HF link |
English multimodal dialogue |
What is InternVL?
We released InternVL, scaling up the ViT to 6B parameters and aligning it with LLM. It is the largest open-source vision/vision-language foundation model (14B) to date, achieving 32 state-of-the-art performance on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.
How is InternVL trained?
The training strategy of InternVL consists of three progressive stages, including vision-language contrastive training, vision-language generative training, and supervised fine-tuning. These stages effectively leverage public data from diverse sources, ranging from noisy image-text pairs on the web to high-quality caption, VQA, and multi-modal dialogue datasets.
What can InternVL do?
InternVL is a βSwiss Army Knifeβ Model. By flexibly combining the vision encoder and the language middleware, InternVL can support various vision or vision-language tasks, including:
Visual Perception
- Linear-Probe Image Classification
* ViT-22B uses the private JFT-3B dataset.
Method |
#Param |
IN-1K |
IN-ReaL |
IN-V2 |
IN-A |
IN-R |
IN-Sketch |
ViT-22B* |
21.7B |
89.5 |
90.9 |
83.2 |
83.8 |
87.4 |
- |
OpenCLIP-G |
1.8B |
86.2 |
89.4 |
77.2 |
63.8 |
87.8 |
66.4 |
DINOv2-g |
1.1B |
86.5 |
89.6 |
78.4 |
75.9 |
78.8 |
62.5 |
EVA-01-CLIP-g |
1.1B |
86.5 |
89.3 |
77.4 |
70.5 |
87.7 |
63.1 |
MAWS-ViT-6.5B |
6.5B |
87.8 |
- |
- |
- |
- |
- |
InternViT-6B |
5.9B |
88.2 |
90.4 |
79.9 |
77.5 |
89.8 |
69.1 |
- Semantic Segmentation
Method |
Decoder |
#Param (Train / Total) |
Crop Size |
mIoU |
OpenCLIP-G (frozen) |
Linear |
0.3M / 1.8B |
512 |
39.3 |
ViT-22B (frozen) |
Linear |
0.9M / 21.7B |
504 |
34.6 |
InternViT-6B (frozen) |
Linear |
0.5M / 5.9B |
504 |
47.2 (+12.6) |
ViT-22B (frozen) |
UperNet |
0.8B / 22.5B |
504 |
52.7 |
InternViT-6B (frozen) |
UperNet |
0.4B / 6.3B |
504 |
54.9 (+2.2) |
ViT-22B |
UperNet |
22.5B / 22.5B |
504 |
55.3 |
InternViT-6B |
UperNet |
6.3B / 6.3B |
504 |
58.9 (+3.6) |
- Zero-Shot Image Classification
Method |
IN-1K |
IN-A |
IN-R |
IN-V2 |
IN-Sketch |
ObjectNet |
ViT-22B* |
85.9 |
90.1 |
96.0 |
80.9 |
- |
87.6 |
OpenCLIP-G |
80.1 |
69.3 |
92.1 |
73.6 |
68.9 |
73.0 |
EVA-02-CLIP-E+ |
82.0 |
82.1 |
94.5 |
75.7 |
71.6 |
79.6 |
InternVL-C |
83.2 |
83.8 |
95.5 |
77.3 |
73.9 |
80.6 |
- Multilingual Zero-Shot Image Classification
EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian
Method |
IN-1K (EN) |
IN-1K (ZH) |
IN-1K (JP) |
IN-1K (AR) |
IN-1K (IT) |
Taiyi-CLIP-ViT-H |
- |
54.4 |
- |
- |
- |
WuKong-ViT-L-G |
- |
57.5 |
- |
- |
- |
CN-CLIP-ViT-H |
- |
59.6 |
- |
- |
- |
AltCLIP-ViT-L |
74.5 |
59.6 |
- |
- |
- |
EVA-02-CLIP-E+ |
82.0 |
- |
- |
- |
41.2 |
OpenCLIP-XLM-R-H |
77.0 |
55.7 |
53.1 |
37.0 |
56.8 |
InternVL-C |
83.2 |
64.5 |
61.5 |
44.9 |
65.7 |
- Zero-Shot Video Classification
Method |
#Frame |
K400 |
K600 |
K700 |
OpenCLIP-G |
1 |
65.9 |
66.1 |
59.2 |
EVA-02-CLIP-E+ |
1 |
69.8 |
69.3 |
63.4 |
InternVL-C |
1 |
71.0 |
71.3 |
65.7 |
ViCLIP |
8 |
75.7 |
73.5 |
66.4 |
InternVL-C |
8 |
79.4 |
78.8 |
71.5 |
Cross-Modal Retrieval
- English Zero-Shot Image-Text Retrieval
Model |
Flickr30K |
COCO |
Average |
image-to-text |
text-to-image |
image-to-text |
text-to-image |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
OpenCLIP-G |
92.9 |
99.3 |
99.8 |
79.5 |
95.0 |
97.1 |
67.3 |
86.9 |
92.6 |
51.4 |
74.9 |
83.0 |
85.0 |
EVA-02-CLIP-E+ |
93.9 |
99.4 |
99.8 |
78.8 |
94.2 |
96.8 |
68.8 |
87.8 |
92.8 |
51.1 |
75.0 |
82.7 |
85.1 |
EVA-CLIP-8B |
95.6 |
99.6 |
99.9 |
80.8 |
95.5 |
97.6 |
70.3 |
89.3 |
93.9 |
53.0 |
76.0 |
83.4 |
86.2 |
InternVL-C |
94.7 |
99.6 |
99.9 |
81.7 |
96.0 |
98.2 |
70.6 |
89.0 |
93.5 |
54.1 |
77.3 |
84.6 |
86.6 |
InternVL-G |
95.7 |
99.7 |
99.9 |
85.0 |
97.0 |
98.6 |
74.9 |
91.3 |
95.2 |
58.6 |
81.3 |
88.0 |
88.8 |
- Chinese Zero-Shot Image-Text Retrieval
Model |
Flickr30K-CN |
COCO-CN |
Average |
image-to-text |
text-to-image |
image-to-text |
text-to-image |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
CN-CLIP-ViT-H |
81.6 |
97.5 |
98.8 |
71.2 |
91.4 |
95.5 |
63.0 |
86.6 |
92.9 |
69.2 |
89.9 |
96.1 |
86.1 |
OpenCLIP-XLM-R-H |
86.1 |
97.5 |
99.2 |
71.0 |
90.5 |
94.9 |
70.0 |
91.5 |
97.0 |
66.1 |
90.8 |
96.0 |
87.6 |
InternVL-C |
90.3 |
98.8 |
99.7 |
75.1 |
92.9 |
96.4 |
68.8 |
92.0 |
96.7 |
68.9 |
91.9 |
96.5 |
89.0 |
InternVL-G |
92.9 |
99.4 |
99.8 |
77.7 |
94.8 |
97.3 |
71.4 |
93.9 |
97.7 |
73.8 |
94.4 |
98.1 |
90.9 |
- Multilingual Zero-Shot Image-Text Retrieval on XTD
Method |
EN |
ES |
FR |
ZH |
IT |
KO |
RU |
JP |
Average |
AltCLIP |
95.4 |
94.1 |
92.9 |
95.1 |
94.2 |
94.4 |
91.8 |
91.7 |
93.7 |
OpenCLIP-XLM-R-H |
97.3 |
96.1 |
94.5 |
94.7 |
96.0 |
90.2 |
93.9 |
94.0 |
94.6 |
InternVL-C |
97.3 |
95.7 |
95.1 |
95.6 |
96.0 |
92.2 |
93.3 |
95.5 |
95.1 |
InternVL-G |
98.6 |
97.7 |
96.5 |
96.7 |
96.9 |
95.1 |
94.8 |
96.1 |
96.6 |
Multimodal Dialogue
- Zero-Shot Image Captioning
Method |
COCO |
Flickr30K |
NoCaps |
Emu-I |
117.7 |
- |
- |
DreamLLM |
115.4 |
- |
- |
InternVL-G |
128.2 |
79.2 |
113.7 |
- Multimodal Benchmarks with Frozen LLM
Method |
Vision Encoder |
Glue Layer |
LLM |
Res |
COCO |
Flickr |
NoCaps |
VQAv2 |
GQA |
VizWiz |
TextVQA |
MME |
POPE |
InstructBLIP |
EVA-g |
QFormer |
Vicuna-7B |
224 |
β |
82.4 |
123.1 |
β |
49.2 |
34.5 |
50.1 |
β |
β |
BLIP-2 |
EVA-g |
QFormer |
Vicuna-13B |
224 |
β |
71.6 |
103.9 |
41.0 |
41.0 |
19.6 |
42.5 |
1293.8 |
85.3 |
InstructBLIP |
EVA-g |
QFormer |
Vicuna-13B |
224 |
β |
82.8 |
121.9 |
β |
49.5 |
33.4 |
50.7 |
1212.8 |
78.9 |
InternVL-Chat |
IViT-6B-224px |
QLLaMA |
Vicuna-7B |
224 |
141.4 |
89.7 |
120.5 |
72.3 |
57.7 |
44.5 |
42.1 |
1298.5 |
85.2 |
InternVL-Chat |
IViT-6B-224px |
QLLaMA |
Vicuna-13B |
224 |
142.4 |
89.9 |
123.1 |
71.7 |
59.5 |
54.0 |
49.1 |
1317.2 |
85.4 |
- Multimodal Benchmarks with Trainable LLM
Method |
Vision Encoder |
LLM |
Res |
VQAv2 |
GQA |
VizWiz |
SQA |
TextVQA |
POPE |
MME |
MMB |
MMBCN |
MMVet |
LLaVA-1.5 |
CLIP-L-336px |
Vicuna-7B |
336 |
78.5 |
62.0 |
50.0 |
66.8 |
58.2 |
85.9 |
1510.7 |
64.3 |
58.3 |
30.5 |
InternVL-Chat |
IViT-6B-224px |
Vicuna-7B |
336 |
79.3 |
62.9 |
52.5 |
66.2 |
57.0 |
86.4 |
1525.1 |
64.6 |
57.6 |
31.2 |
LLaVA-1.5 |
CLIP-L-336px |
Vicuna-13B |
336 |
80.0 |
63.3 |
53.6 |
71.6 |
61.3 |
85.9 |
1531.3 |
67.7 |
63.6 |
35.4 |
InternVL-Chat |
IViT-6B-224px |
Vicuna-13B |
336 |
80.2 |
63.9 |
54.6 |
70.1 |
58.7 |
87.1 |
1546.9 |
66.5 |
61.9 |
33.7 |
InternVL-Chat |
IViT-6B-448px |
Vicuna-13B |
448 |
82.0 |
64.1 |
60.1 |
71.6 |
64.8 |
87.2 |
1579.0 |
68.2 |
64.0 |
36.7 |
Citation
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={24185--24198},
year={2024}
}