Type Model Date Download Note
Vision-Language
Foundation Model
InternViT-6B-224px 2023.12.22 πŸ€— HF link vision foundation model
InternVL-14B-224px 2023.12.22 πŸ€— HF link vision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP
Vision Large
Language Model
InternVL-Chat-19B-448px 2024.02.03 πŸ€— HF link 448 resolution
InternVL-Chat-19B 2023.12.25 πŸ€— HF link English multimodal dialogue
InternVL-Chat-13B 2023.12.25 πŸ€— HF link English multimodal dialogue

What is InternVL?

We released InternVL, scaling up the ViT to 6B parameters and aligning it with LLM. It is the largest open-source vision/vision-language foundation model (14B) to date, achieving 32 state-of-the-art performance on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.

image

How is InternVL trained?

The training strategy of InternVL consists of three progressive stages, including vision-language contrastive training, vision-language generative training, and supervised fine-tuning. These stages effectively leverage public data from diverse sources, ranging from noisy image-text pairs on the web to high-quality caption, VQA, and multi-modal dialogue datasets.

image

What can InternVL do?

InternVL is a β€œSwiss Army Knife” Model. By flexibly combining the vision encoder and the language middleware, InternVL can support various vision or vision-language tasks, including:

Visual Perception

  • Linear-Probe Image Classification

    * ViT-22B uses the private JFT-3B dataset.

    Method #Param IN-1K IN-ReaL IN-V2 IN-A IN-R IN-Sketch
    ViT-22B* 21.7B 89.5 90.9 83.2 83.8 87.4 -
    OpenCLIP-G 1.8B 86.2 89.4 77.2 63.8 87.8 66.4
    DINOv2-g 1.1B 86.5 89.6 78.4 75.9 78.8 62.5
    EVA-01-CLIP-g 1.1B 86.5 89.3 77.4 70.5 87.7 63.1
    MAWS-ViT-6.5B 6.5B 87.8 - - - - -
    InternViT-6B 5.9B 88.2 90.4 79.9 77.5 89.8 69.1
  • Semantic Segmentation

  • Method Decoder #Param (Train / Total) Crop Size mIoU
    OpenCLIP-G (frozen) Linear 0.3M / 1.8B 512 39.3
    ViT-22B (frozen) Linear 0.9M / 21.7B 504 34.6
    InternViT-6B (frozen) Linear 0.5M / 5.9B 504 47.2 (+12.6)
    ViT-22B (frozen) UperNet 0.8B / 22.5B 504 52.7
    InternViT-6B (frozen) UperNet 0.4B / 6.3B 504 54.9 (+2.2)
    ViT-22B UperNet 22.5B / 22.5B 504 55.3
    InternViT-6B UperNet 6.3B / 6.3B 504 58.9 (+3.6)
  • Zero-Shot Image Classification

  • Method IN-1K IN-A IN-R IN-V2 IN-Sketch ObjectNet
    ViT-22B* 85.9 90.1 96.0 80.9 - 87.6
    OpenCLIP-G 80.1 69.3 92.1 73.6 68.9 73.0
    EVA-02-CLIP-E+ 82.0 82.1 94.5 75.7 71.6 79.6
    InternVL-C 83.2 83.8 95.5 77.3 73.9 80.6
  • Multilingual Zero-Shot Image Classification
  • EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

    Method IN-1K (EN) IN-1K (ZH) IN-1K (JP) IN-1K (AR) IN-1K (IT)
    Taiyi-CLIP-ViT-H - 54.4 - - -
    WuKong-ViT-L-G - 57.5 - - -
    CN-CLIP-ViT-H - 59.6 - - -
    AltCLIP-ViT-L 74.5 59.6 - - -
    EVA-02-CLIP-E+ 82.0 - - - 41.2
    OpenCLIP-XLM-R-H 77.0 55.7 53.1 37.0 56.8
    InternVL-C 83.2 64.5 61.5 44.9 65.7
  • Zero-Shot Video Classification

  • Method #Frame K400 K600 K700
    OpenCLIP-G 1 65.9 66.1 59.2
    EVA-02-CLIP-E+ 1 69.8 69.3 63.4
    InternVL-C 1 71.0 71.3 65.7
    ViCLIP 8 75.7 73.5 66.4
    InternVL-C 8 79.4 78.8 71.5

Cross-Modal Retrieval

  • English Zero-Shot Image-Text Retrieval

  • Model Flickr30K COCO Average
    image-to-text text-to-image image-to-text text-to-image
    R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
    OpenCLIP-G 92.9 99.3 99.8 79.5 95.0 97.1 67.3 86.9 92.6 51.4 74.9 83.0 85.0
    EVA-02-CLIP-E+ 93.9 99.4 99.8 78.8 94.2 96.8 68.8 87.8 92.8 51.1 75.0 82.7 85.1
    EVA-CLIP-8B 95.6 99.6 99.9 80.8 95.5 97.6 70.3 89.3 93.9 53.0 76.0 83.4 86.2
    InternVL-C 94.7 99.6 99.9 81.7 96.0 98.2 70.6 89.0 93.5 54.1 77.3 84.6 86.6
    InternVL-G 95.7 99.7 99.9 85.0 97.0 98.6 74.9 91.3 95.2 58.6 81.3 88.0 88.8
  • Chinese Zero-Shot Image-Text Retrieval

  • Model Flickr30K-CN COCO-CN Average
    image-to-text text-to-image image-to-text text-to-image
    R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
    CN-CLIP-ViT-H 81.6 97.5 98.8 71.2 91.4 95.5 63.0 86.6 92.9 69.2 89.9 96.1 86.1
    OpenCLIP-XLM-R-H 86.1 97.5 99.2 71.0 90.5 94.9 70.0 91.5 97.0 66.1 90.8 96.0 87.6
    InternVL-C 90.3 98.8 99.7 75.1 92.9 96.4 68.8 92.0 96.7 68.9 91.9 96.5 89.0
    InternVL-G 92.9 99.4 99.8 77.7 94.8 97.3 71.4 93.9 97.7 73.8 94.4 98.1 90.9
  • Multilingual Zero-Shot Image-Text Retrieval on XTD

  • Method EN ES FR ZH IT KO RU JP Average
    AltCLIP 95.4 94.1 92.9 95.1 94.2 94.4 91.8 91.7 93.7
    OpenCLIP-XLM-R-H 97.3 96.1 94.5 94.7 96.0 90.2 93.9 94.0 94.6
    InternVL-C 97.3 95.7 95.1 95.6 96.0 92.2 93.3 95.5 95.1
    InternVL-G 98.6 97.7 96.5 96.7 96.9 95.1 94.8 96.1 96.6

Multimodal Dialogue

  • Zero-Shot Image Captioning

  • Method COCO Flickr30K NoCaps
    Emu-I 117.7 - -
    DreamLLM 115.4 - -
    InternVL-G 128.2 79.2 113.7
  • Multimodal Benchmarks with Frozen LLM

  • Method Vision Encoder Glue Layer LLM Res COCO Flickr NoCaps VQAv2 GQA VizWiz TextVQA MME POPE
    InstructBLIP EVA-g QFormer Vicuna-7B 224 – 82.4 123.1 – 49.2 34.5 50.1 – –
    BLIP-2 EVA-g QFormer Vicuna-13B 224 – 71.6 103.9 41.0 41.0 19.6 42.5 1293.8 85.3
    InstructBLIP EVA-g QFormer Vicuna-13B 224 – 82.8 121.9 – 49.5 33.4 50.7 1212.8 78.9
    InternVL-Chat IViT-6B-224px QLLaMA Vicuna-7B 224 141.4 89.7 120.5 72.3 57.7 44.5 42.1 1298.5 85.2
    InternVL-Chat IViT-6B-224px QLLaMA Vicuna-13B 224 142.4 89.9 123.1 71.7 59.5 54.0 49.1 1317.2 85.4
  • Multimodal Benchmarks with Trainable LLM

  • Method Vision Encoder LLM Res VQAv2 GQA VizWiz SQA TextVQA POPE MME MMB MMBCN MMVet
    LLaVA-1.5 CLIP-L-336px Vicuna-7B 336 78.5 62.0 50.0 66.8 58.2 85.9 1510.7 64.3 58.3 30.5
    InternVL-Chat IViT-6B-224px Vicuna-7B 336 79.3 62.9 52.5 66.2 57.0 86.4 1525.1 64.6 57.6 31.2
    LLaVA-1.5 CLIP-L-336px Vicuna-13B 336 80.0 63.3 53.6 71.6 61.3 85.9 1531.3 67.7 63.6 35.4
    InternVL-Chat IViT-6B-224px Vicuna-13B 336 80.2 63.9 54.6 70.1 58.7 87.1 1546.9 66.5 61.9 33.7
    InternVL-Chat IViT-6B-448px Vicuna-13B 448 82.0 64.1 60.1 71.6 64.8 87.2 1579.0 68.2 64.0 36.7

Citation


  @inproceedings{chen2024internvl,
    title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
    author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={24185--24198},
    year={2024}
  }


πŸ”™ Go Back