Type |
Model |
Date |
Download |
Note |
Vision-Language Foundation Model |
InternViT-6B-224px |
2023.12.22 |
π€ HF link |
vision foundation model |
InternVL-14B-224px |
2023.12.22 |
π€ HF link |
vision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP |
Vision Large Language Model |
InternVL-Chat-19B-448px |
2024.02.03 |
π€ HF link |
448 resolution |
InternVL-Chat-19B |
2023.12.25 |
π€ HF link |
English multimodal dialogue |
InternVL-Chat-13B |
2023.12.25 |
π€ HF link |
English multimodal dialogue |
What is InternVL?
We released InternVL, scaling up the ViT to 6B parameters and aligning it with LLM. It is the largest open-source vision/vision-language foundation model (14B) to date, achieving 32 state-of-the-art performance on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.
How is InternVL trained?
The training strategy of InternVL consists of three progressive stages, including vision-language contrastive training, vision-language generative training, and supervised fine-tuning. These stages effectively leverage public data from diverse sources, ranging from noisy image-text pairs on the web to high-quality caption, VQA, and multi-modal dialogue datasets.
What can InternVL do?
InternVL is a βSwiss Army Knifeβ Model. By flexibly combining the vision encoder and the language middleware, InternVL can support various vision or vision-language tasks, including:
Visual Perception
- Linear-Probe Image Classification
* ViT-22B uses the private JFT-3B dataset.
Method |
#Param |
IN-1K |
IN-ReaL |
IN-V2 |
IN-A |
IN-R |
IN-Sketch |
ViT-22B* |
21.7B |
89.5 |
90.9 |
83.2 |
83.8 |
87.4 |
- |
OpenCLIP-G |
1.8B |
86.2 |
89.4 |
77.2 |
63.8 |
87.8 |
66.4 |
DINOv2-g |
1.1B |
86.5 |
89.6 |
78.4 |
75.9 |
78.8 |
62.5 |
EVA-01-CLIP-g |
1.1B |
86.5 |
89.3 |
77.4 |
70.5 |
87.7 |
63.1 |
MAWS-ViT-6.5B |
6.5B |
87.8 |
- |
- |
- |
- |
- |
InternViT-6B |
5.9B |
88.2 |
90.4 |
79.9 |
77.5 |
89.8 |
69.1 |
- Semantic Segmentation
Method |
Decoder |
#Param (Train / Total) |
Crop Size |
mIoU |
OpenCLIP-G (frozen) |
Linear |
0.3M / 1.8B |
512 |
39.3 |
ViT-22B (frozen) |
Linear |
0.9M / 21.7B |
504 |
34.6 |
InternViT-6B (frozen) |
Linear |
0.5M / 5.9B |
504 |
47.2 (+12.6) |
ViT-22B (frozen) |
UperNet |
0.8B / 22.5B |
504 |
52.7 |
InternViT-6B (frozen) |
UperNet |
0.4B / 6.3B |
504 |
54.9 (+2.2) |
ViT-22B |
UperNet |
22.5B / 22.5B |
504 |
55.3 |
InternViT-6B |
UperNet |
6.3B / 6.3B |
504 |
58.9 (+3.6) |
- Zero-Shot Image Classification
Method |
IN-1K |
IN-A |
IN-R |
IN-V2 |
IN-Sketch |
ObjectNet |
ViT-22B* |
85.9 |
90.1 |
96.0 |
80.9 |
- |
87.6 |
OpenCLIP-G |
80.1 |
69.3 |
92.1 |
73.6 |
68.9 |
73.0 |
EVA-02-CLIP-E+ |
82.0 |
82.1 |
94.5 |
75.7 |
71.6 |
79.6 |
InternVL-C |
83.2 |
83.8 |
95.5 |
77.3 |
73.9 |
80.6 |
- Multilingual Zero-Shot Image Classification
EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian
Method |
IN-1K (EN) |
IN-1K (ZH) |
IN-1K (JP) |
IN-1K (AR) |
IN-1K (IT) |
Taiyi-CLIP-ViT-H |
- |
54.4 |
- |
- |
- |
WuKong-ViT-L-G |
- |
57.5 |
- |
- |
- |
CN-CLIP-ViT-H |
- |
59.6 |
- |
- |
- |
AltCLIP-ViT-L |
74.5 |
59.6 |
- |
- |
- |
EVA-02-CLIP-E+ |
82.0 |
- |
- |
- |
41.2 |
OpenCLIP-XLM-R-H |
77.0 |
55.7 |
53.1 |
37.0 |
56.8 |
InternVL-C |
83.2 |
64.5 |
61.5 |
44.9 |
65.7 |
- Zero-Shot Video Classification
Method |
#Frame |
K400 |
K600 |
K700 |
OpenCLIP-G |
1 |
65.9 |
66.1 |
59.2 |
EVA-02-CLIP-E+ |
1 |
69.8 |
69.3 |
63.4 |
InternVL-C |
1 |
71.0 |
71.3 |
65.7 |
ViCLIP |
8 |
75.7 |
73.5 |
66.4 |
InternVL-C |
8 |
79.4 |
78.8 |
71.5 |
Cross-Modal Retrieval
- English Zero-Shot Image-Text Retrieval
Model |
Flickr30K |
COCO |
Average |
image-to-text |
text-to-image |
image-to-text |
text-to-image |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
OpenCLIP-G |
92.9 |
99.3 |
99.8 |
79.5 |
95.0 |
97.1 |
67.3 |
86.9 |
92.6 |
51.4 |
74.9 |
83.0 |
85.0 |
EVA-02-CLIP-E+ |
93.9 |
99.4 |
99.8 |
78.8 |
94.2 |
96.8 |
68.8 |
87.8 |
92.8 |
51.1 |
75.0 |
82.7 |
85.1 |
EVA-CLIP-8B |
95.6 |
99.6 |
99.9 |
80.8 |
95.5 |
97.6 |
70.3 |
89.3 |
93.9 |
53.0 |
76.0 |
83.4 |
86.2 |
InternVL-C |
94.7 |
99.6 |
99.9 |
81.7 |
96.0 |
98.2 |
70.6 |
89.0 |
93.5 |
54.1 |
77.3 |
84.6 |
86.6 |
InternVL-G |
95.7 |
99.7 |
99.9 |
85.0 |
97.0 |
98.6 |
74.9 |
91.3 |
95.2 |
58.6 |
81.3 |
88.0 |
88.8 |
- Chinese Zero-Shot Image-Text Retrieval
Model |
Flickr30K-CN |
COCO-CN |
Average |
image-to-text |
text-to-image |
image-to-text |
text-to-image |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
CN-CLIP-ViT-H |
81.6 |
97.5 |
98.8 |
71.2 |
91.4 |
95.5 |
63.0 |
86.6 |
92.9 |
69.2 |
89.9 |
96.1 |
86.1 |
OpenCLIP-XLM-R-H |
86.1 |
97.5 |
99.2 |
71.0 |
90.5 |
94.9 |
70.0 |
91.5 |
97.0 |
66.1 |
90.8 |
96.0 |
87.6 |
InternVL-C |
90.3 |
98.8 |
99.7 |
75.1 |
92.9 |
96.4 |
68.8 |
92.0 |
96.7 |
68.9 |
91.9 |
96.5 |
89.0 |
InternVL-G |
92.9 |
99.4 |
99.8 |
77.7 |
94.8 |
97.3 |
71.4 |
93.9 |
97.7 |
73.8 |
94.4 |
98.1 |
90.9 |
- Multilingual Zero-Shot Image-Text Retrieval on XTD
Method |
EN |
ES |
FR |
ZH |
IT |
KO |
RU |
JP |
Average |
AltCLIP |
95.4 |
94.1 |
92.9 |
95.1 |
94.2 |
94.4 |
91.8 |
91.7 |
93.7 |
OpenCLIP-XLM-R-H |
97.3 |
96.1 |
94.5 |
94.7 |
96.0 |
90.2 |
93.9 |
94.0 |
94.6 |
InternVL-C |
97.3 |
95.7 |
95.1 |
95.6 |
96.0 |
92.2 |
93.3 |
95.5 |
95.1 |
InternVL-G |
98.6 |
97.7 |
96.5 |
96.7 |
96.9 |
95.1 |
94.8 |
96.1 |
96.6 |
Multimodal Dialogue
- Zero-Shot Image Captioning
Method |
COCO |
Flickr30K |
NoCaps |
Emu-I |
117.7 |
- |
- |
DreamLLM |
115.4 |
- |
- |
InternVL-G |
128.2 |
79.2 |
113.7 |
- Multimodal Benchmarks with Frozen LLM
Method |
Vision Encoder |
Glue Layer |
LLM |
Res |
COCO |
Flickr |
NoCaps |
VQAv2 |
GQA |
VizWiz |
TextVQA |
MME |
POPE |
InstructBLIP |
EVA-g |
QFormer |
Vicuna-7B |
224 |
β |
82.4 |
123.1 |
β |
49.2 |
34.5 |
50.1 |
β |
β |
BLIP-2 |
EVA-g |
QFormer |
Vicuna-13B |
224 |
β |
71.6 |
103.9 |
41.0 |
41.0 |
19.6 |
42.5 |
1293.8 |
85.3 |
InstructBLIP |
EVA-g |
QFormer |
Vicuna-13B |
224 |
β |
82.8 |
121.9 |
β |
49.5 |
33.4 |
50.7 |
1212.8 |
78.9 |
InternVL-Chat |
IViT-6B-224px |
QLLaMA |
Vicuna-7B |
224 |
141.4 |
89.7 |
120.5 |
72.3 |
57.7 |
44.5 |
42.1 |
1298.5 |
85.2 |
InternVL-Chat |
IViT-6B-224px |
QLLaMA |
Vicuna-13B |
224 |
142.4 |
89.9 |
123.1 |
71.7 |
59.5 |
54.0 |
49.1 |
1317.2 |
85.4 |
- Multimodal Benchmarks with Trainable LLM
Method |
Vision Encoder |
LLM |
Res |
VQAv2 |
GQA |
VizWiz |
SQA |
TextVQA |
POPE |
MME |
MMB |
MMBCN |
MMVet |
LLaVA-1.5 |
CLIP-L-336px |
Vicuna-7B |
336 |
78.5 |
62.0 |
50.0 |
66.8 |
58.2 |
85.9 |
1510.7 |
64.3 |
58.3 |
30.5 |
InternVL-Chat |
IViT-6B-224px |
Vicuna-7B |
336 |
79.3 |
62.9 |
52.5 |
66.2 |
57.0 |
86.4 |
1525.1 |
64.6 |
57.6 |
31.2 |
LLaVA-1.5 |
CLIP-L-336px |
Vicuna-13B |
336 |
80.0 |
63.3 |
53.6 |
71.6 |
61.3 |
85.9 |
1531.3 |
67.7 |
63.6 |
35.4 |
InternVL-Chat |
IViT-6B-224px |
Vicuna-13B |
336 |
80.2 |
63.9 |
54.6 |
70.1 |
58.7 |
87.1 |
1546.9 |
66.5 |
61.9 |
33.7 |
InternVL-Chat |
IViT-6B-448px |
Vicuna-13B |
448 |
82.0 |
64.1 |
60.1 |
71.6 |
64.8 |
87.2 |
1579.0 |
68.2 |
64.0 |
36.7 |
Citation
@article{chen2023internvl,
title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
journal={arXiv preprint arXiv:2312.14238},
year={2023}
}
@article{chen2024far,
title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
journal={arXiv preprint arXiv:2404.16821},
year={2024}
}