InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Type	Model	Date	Download	Note
Vision-Language Foundation Model	InternViT-6B-224px	2023.12.22	🤗 HF link	vision foundation model
Vision-Language Foundation Model	InternVL-14B-224px	2023.12.22	🤗 HF link	vision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP
Vision Large Language Model	InternVL-Chat-19B-448px	2024.02.03	🤗 HF link	448 resolution
	InternVL-Chat-19B	2023.12.25	🤗 HF link	English multimodal dialogue
	InternVL-Chat-13B	2023.12.25	🤗 HF link	English multimodal dialogue

What is InternVL?

We released InternVL, scaling up the ViT to 6B parameters and aligning it with LLM. It is the largest open-source vision/vision-language foundation model (14B) to date, achieving 32 state-of-the-art performance on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.

How is InternVL trained?

The training strategy of InternVL consists of three progressive stages, including vision-language contrastive training, vision-language generative training, and supervised fine-tuning. These stages effectively leverage public data from diverse sources, ranging from noisy image-text pairs on the web to high-quality caption, VQA, and multi-modal dialogue datasets.

What can InternVL do?

InternVL is a “Swiss Army Knife” Model. By flexibly combining the vision encoder and the language middleware, InternVL can support various vision or vision-language tasks, including:

Visual Perception

Linear-Probe Image Classification

* ViT-22B uses the private JFT-3B dataset.

Method	#Param	IN-1K	IN-ReaL	IN-V2	IN-A	IN-R	IN-Sketch
ViT-22B*	21.7B	89.5	90.9	83.2	83.8	87.4	-
OpenCLIP-G	1.8B	86.2	89.4	77.2	63.8	87.8	66.4
DINOv2-g	1.1B	86.5	89.6	78.4	75.9	78.8	62.5
EVA-01-CLIP-g	1.1B	86.5	89.3	77.4	70.5	87.7	63.1
MAWS-ViT-6.5B	6.5B	87.8	-	-	-	-	-
InternViT-6B	5.9B	88.2	90.4	79.9	77.5	89.8	69.1

Semantic Segmentation

Method	Decoder	#Param (Train / Total)	Crop Size	mIoU
OpenCLIP-G (frozen)	Linear	0.3M / 1.8B	512	39.3
ViT-22B (frozen)	Linear	0.9M / 21.7B	504	34.6
InternViT-6B (frozen)	Linear	0.5M / 5.9B	504	47.2 (+12.6)
ViT-22B (frozen)	UperNet	0.8B / 22.5B	504	52.7
InternViT-6B (frozen)	UperNet	0.4B / 6.3B	504	54.9 (+2.2)
ViT-22B	UperNet	22.5B / 22.5B	504	55.3
InternViT-6B	UperNet	6.3B / 6.3B	504	58.9 (+3.6)

Zero-Shot Image Classification

Method	IN-1K	IN-A	IN-R	IN-V2	IN-Sketch	ObjectNet
ViT-22B*	85.9	90.1	96.0	80.9	-	87.6
OpenCLIP-G	80.1	69.3	92.1	73.6	68.9	73.0
EVA-02-CLIP-E+	82.0	82.1	94.5	75.7	71.6	79.6
InternVL-C	83.2	83.8	95.5	77.3	73.9	80.6

Multilingual Zero-Shot Image Classification

EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

Method	IN-1K (EN)	IN-1K (ZH)	IN-1K (JP)	IN-1K (AR)	IN-1K (IT)
Taiyi-CLIP-ViT-H	-	54.4	-	-	-
WuKong-ViT-L-G	-	57.5	-	-	-
CN-CLIP-ViT-H	-	59.6	-	-	-
AltCLIP-ViT-L	74.5	59.6	-	-	-
EVA-02-CLIP-E+	82.0	-	-	-	41.2
OpenCLIP-XLM-R-H	77.0	55.7	53.1	37.0	56.8
InternVL-C	83.2	64.5	61.5	44.9	65.7

Zero-Shot Video Classification

Method	#Frame	K400	K600	K700
OpenCLIP-G	1	65.9	66.1	59.2
EVA-02-CLIP-E+	1	69.8	69.3	63.4
InternVL-C	1	71.0	71.3	65.7
ViCLIP	8	75.7	73.5	66.4
InternVL-C	8	79.4	78.8	71.5

Cross-Modal Retrieval

English Zero-Shot Image-Text Retrieval

Model	Flickr30K						COCO						Average
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
OpenCLIP-G	92.9	99.3	99.8	79.5	95.0	97.1	67.3	86.9	92.6	51.4	74.9	83.0	85.0
EVA-02-CLIP-E+	93.9	99.4	99.8	78.8	94.2	96.8	68.8	87.8	92.8	51.1	75.0	82.7	85.1
EVA-CLIP-8B	95.6	99.6	99.9	80.8	95.5	97.6	70.3	89.3	93.9	53.0	76.0	83.4	86.2
InternVL-C	94.7	99.6	99.9	81.7	96.0	98.2	70.6	89.0	93.5	54.1	77.3	84.6	86.6
InternVL-G	95.7	99.7	99.9	85.0	97.0	98.6	74.9	91.3	95.2	58.6	81.3	88.0	88.8

Chinese Zero-Shot Image-Text Retrieval

Model	Flickr30K-CN						COCO-CN						Average
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
CN-CLIP-ViT-H	81.6	97.5	98.8	71.2	91.4	95.5	63.0	86.6	92.9	69.2	89.9	96.1	86.1
OpenCLIP-XLM-R-H	86.1	97.5	99.2	71.0	90.5	94.9	70.0	91.5	97.0	66.1	90.8	96.0	87.6
InternVL-C	90.3	98.8	99.7	75.1	92.9	96.4	68.8	92.0	96.7	68.9	91.9	96.5	89.0
InternVL-G	92.9	99.4	99.8	77.7	94.8	97.3	71.4	93.9	97.7	73.8	94.4	98.1	90.9

Multilingual Zero-Shot Image-Text Retrieval on XTD

Method	EN	ES	FR	ZH	IT	KO	RU	JP	Average
AltCLIP	95.4	94.1	92.9	95.1	94.2	94.4	91.8	91.7	93.7
OpenCLIP-XLM-R-H	97.3	96.1	94.5	94.7	96.0	90.2	93.9	94.0	94.6
InternVL-C	97.3	95.7	95.1	95.6	96.0	92.2	93.3	95.5	95.1
InternVL-G	98.6	97.7	96.5	96.7	96.9	95.1	94.8	96.1	96.6

Multimodal Dialogue

Zero-Shot Image Captioning

Method	COCO	Flickr30K	NoCaps
Emu-I	117.7	-	-
DreamLLM	115.4	-	-
InternVL-G	128.2	79.2	113.7

Multimodal Benchmarks with Frozen LLM

Method	Vision Encoder	Glue Layer	LLM	Res	COCO	Flickr	NoCaps	VQAv2	GQA	VizWiz	TextVQA	MME	POPE
InstructBLIP	EVA-g	QFormer	Vicuna-7B	224	–	82.4	123.1	–	49.2	34.5	50.1	–	–
BLIP-2	EVA-g	QFormer	Vicuna-13B	224	–	71.6	103.9	41.0	41.0	19.6	42.5	1293.8	85.3
InstructBLIP	EVA-g	QFormer	Vicuna-13B	224	–	82.8	121.9	–	49.5	33.4	50.7	1212.8	78.9
InternVL-Chat	IViT-6B-224px	QLLaMA	Vicuna-7B	224	141.4	89.7	120.5	72.3	57.7	44.5	42.1	1298.5	85.2
InternVL-Chat	IViT-6B-224px	QLLaMA	Vicuna-13B	224	142.4	89.9	123.1	71.7	59.5	54.0	49.1	1317.2	85.4

Multimodal Benchmarks with Trainable LLM

Method	Vision Encoder	LLM	Res	VQAv2	GQA	VizWiz	SQA	TextVQA	POPE	MME	MMB	MMB_CN	MMVet
LLaVA-1.5	CLIP-L-336px	Vicuna-7B	336	78.5	62.0	50.0	66.8	58.2	85.9	1510.7	64.3	58.3	30.5
InternVL-Chat	IViT-6B-224px	Vicuna-7B	336	79.3	62.9	52.5	66.2	57.0	86.4	1525.1	64.6	57.6	31.2
LLaVA-1.5	CLIP-L-336px	Vicuna-13B	336	80.0	63.3	53.6	71.6	61.3	85.9	1531.3	67.7	63.6	35.4
InternVL-Chat	IViT-6B-224px	Vicuna-13B	336	80.2	63.9	54.6	70.1	58.7	87.1	1546.9	66.5	61.9	33.7
InternVL-Chat	IViT-6B-448px	Vicuna-13B	448	82.0	64.1	60.1	71.6	64.8	87.2	1579.0	68.2	64.0	36.7

Citation


  @inproceedings{chen2024internvl,
    title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
    author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={24185--24198},
    year={2024}
  }