Type Model Date Download Note
Vision-Language
Foundation Model
InternViT-6B-224px 2023.12.22 πŸ€— HF link vision foundation model
InternVL-14B-224px 2023.12.22 πŸ€— HF link vision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP
Vision Large
Language Model
InternVL-Chat-19B-448px 2024.02.03 πŸ€— HF link 448 resolution
InternVL-Chat-19B 2023.12.25 πŸ€— HF link English multimodal dialogue
InternVL-Chat-13B 2023.12.25 πŸ€— HF link English multimodal dialogue

What is InternVL?

We released InternVL, scaling up the ViT to 6B parameters and aligning it with LLM. It is the largest open-source vision/vision-language foundation model (14B) to date, achieving 32 state-of-the-art performance on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.

image

How is InternVL trained?

The training strategy of InternVL consists of three progressive stages, including vision-language contrastive training, vision-language generative training, and supervised fine-tuning. These stages effectively leverage public data from diverse sources, ranging from noisy image-text pairs on the web to high-quality caption, VQA, and multi-modal dialogue datasets.

image

What can InternVL do?

InternVL is a β€œSwiss Army Knife” Model. By flexibly combining the vision encoder and the language middleware, InternVL can support various vision or vision-language tasks, including:

Visual Perception

  • Linear-Probe Image Classification

    * ViT-22B uses the private JFT-3B dataset.

    Method #Param IN-1K IN-ReaL IN-V2 IN-A IN-R IN-Sketch
    ViT-22B* 21.7B 89.5 90.9 83.2 83.8 87.4 -
    OpenCLIP-G 1.8B 86.2 89.4 77.2 63.8 87.8 66.4
    DINOv2-g 1.1B 86.5 89.6 78.4 75.9 78.8 62.5
    EVA-01-CLIP-g 1.1B 86.5 89.3 77.4 70.5 87.7 63.1
    MAWS-ViT-6.5B 6.5B 87.8 - - - - -
    InternViT-6B 5.9B 88.2 90.4 79.9 77.5 89.8 69.1
  • Semantic Segmentation

  • Method Decoder #Param (Train / Total) Crop Size mIoU
    OpenCLIP-G (frozen) Linear 0.3M / 1.8B 512 39.3
    ViT-22B (frozen) Linear 0.9M / 21.7B 504 34.6
    InternViT-6B (frozen) Linear 0.5M / 5.9B 504 47.2 (+12.6)
    ViT-22B (frozen) UperNet 0.8B / 22.5B 504 52.7
    InternViT-6B (frozen) UperNet 0.4B / 6.3B 504 54.9 (+2.2)
    ViT-22B UperNet 22.5B / 22.5B 504 55.3
    InternViT-6B UperNet 6.3B / 6.3B 504 58.9 (+3.6)
  • Zero-Shot Image Classification

  • Method IN-1K IN-A IN-R IN-V2 IN-Sketch ObjectNet
    ViT-22B* 85.9 90.1 96.0 80.9 - 87.6
    OpenCLIP-G 80.1 69.3 92.1 73.6 68.9 73.0
    EVA-02-CLIP-E+ 82.0 82.1 94.5 75.7 71.6 79.6
    InternVL-C 83.2 83.8 95.5 77.3 73.9 80.6
  • Multilingual Zero-Shot Image Classification
  • EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

    Method IN-1K (EN) IN-1K (ZH) IN-1K (JP) IN-1K (AR) IN-1K (IT)
    Taiyi-CLIP-ViT-H - 54.4 - - -
    WuKong-ViT-L-G - 57.5 - - -
    CN-CLIP-ViT-H - 59.6 - - -
    AltCLIP-ViT-L 74.5 59.6 - - -
    EVA-02-CLIP-E+ 82.0 - - - 41.2
    OpenCLIP-XLM-R-H 77.0 55.7 53.1 37.0 56.8
    InternVL-C 83.2 64.5 61.5 44.9 65.7
  • Zero-Shot Video Classification

  • Method #Frame K400 K600 K700
    OpenCLIP-G 1 65.9 66.1 59.2
    EVA-02-CLIP-E+ 1 69.8 69.3 63.4
    InternVL-C 1 71.0 71.3 65.7
    ViCLIP 8 75.7 73.5 66.4
    InternVL-C 8 79.4 78.8 71.5

Cross-Modal Retrieval

  • English Zero-Shot Image-Text Retrieval

  • Model Flickr30K COCO Average
    image-to-text text-to-image image-to-text text-to-image
    R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
    OpenCLIP-G 92.9 99.3 99.8 79.5 95.0 97.1 67.3 86.9 92.6 51.4 74.9 83.0 85.0
    EVA-02-CLIP-E+ 93.9 99.4 99.8 78.8 94.2 96.8 68.8 87.8 92.8 51.1 75.0 82.7 85.1
    EVA-CLIP-8B 95.6 99.6 99.9 80.8 95.5 97.6 70.3 89.3 93.9 53.0 76.0 83.4 86.2
    InternVL-C 94.7 99.6 99.9 81.7 96.0 98.2 70.6 89.0 93.5 54.1 77.3 84.6 86.6
    InternVL-G 95.7 99.7 99.9 85.0 97.0 98.6 74.9 91.3 95.2 58.6 81.3 88.0 88.8
  • Chinese Zero-Shot Image-Text Retrieval

  • Model Flickr30K-CN COCO-CN Average
    image-to-text text-to-image image-to-text text-to-image
    R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
    CN-CLIP-ViT-H 81.6 97.5 98.8 71.2 91.4 95.5 63.0 86.6 92.9 69.2 89.9 96.1 86.1
    OpenCLIP-XLM-R-H 86.1 97.5 99.2 71.0 90.5 94.9 70.0 91.5 97.0 66.1 90.8 96.0 87.6
    InternVL-C 90.3 98.8 99.7 75.1 92.9 96.4 68.8 92.0 96.7 68.9 91.9 96.5 89.0
    InternVL-G 92.9 99.4 99.8 77.7 94.8 97.3 71.4 93.9 97.7 73.8 94.4 98.1 90.9
  • Multilingual Zero-Shot Image-Text Retrieval on XTD

  • Method EN ES FR ZH IT KO RU JP Average
    AltCLIP 95.4 94.1 92.9 95.1 94.2 94.4 91.8 91.7 93.7
    OpenCLIP-XLM-R-H 97.3 96.1 94.5 94.7 96.0 90.2 93.9 94.0 94.6
    InternVL-C 97.3 95.7 95.1 95.6 96.0 92.2 93.3 95.5 95.1
    InternVL-G 98.6 97.7 96.5 96.7 96.9 95.1 94.8 96.1 96.6

Multimodal Dialogue

  • Zero-Shot Image Captioning

  • Method COCO Flickr30K NoCaps
    Emu-I 117.7 - -
    DreamLLM 115.4 - -
    InternVL-G 128.2 79.2 113.7
  • Multimodal Benchmarks with Frozen LLM

  • Method Vision Encoder Glue Layer LLM Res COCO Flickr NoCaps VQAv2 GQA VizWiz TextVQA MME POPE
    InstructBLIP EVA-g QFormer Vicuna-7B 224 – 82.4 123.1 – 49.2 34.5 50.1 – –
    BLIP-2 EVA-g QFormer Vicuna-13B 224 – 71.6 103.9 41.0 41.0 19.6 42.5 1293.8 85.3
    InstructBLIP EVA-g QFormer Vicuna-13B 224 – 82.8 121.9 – 49.5 33.4 50.7 1212.8 78.9
    InternVL-Chat IViT-6B-224px QLLaMA Vicuna-7B 224 141.4 89.7 120.5 72.3 57.7 44.5 42.1 1298.5 85.2
    InternVL-Chat IViT-6B-224px QLLaMA Vicuna-13B 224 142.4 89.9 123.1 71.7 59.5 54.0 49.1 1317.2 85.4
  • Multimodal Benchmarks with Trainable LLM

  • Method Vision Encoder LLM Res VQAv2 GQA VizWiz SQA TextVQA POPE MME MMB MMBCN MMVet
    LLaVA-1.5 CLIP-L-336px Vicuna-7B 336 78.5 62.0 50.0 66.8 58.2 85.9 1510.7 64.3 58.3 30.5
    InternVL-Chat IViT-6B-224px Vicuna-7B 336 79.3 62.9 52.5 66.2 57.0 86.4 1525.1 64.6 57.6 31.2
    LLaVA-1.5 CLIP-L-336px Vicuna-13B 336 80.0 63.3 53.6 71.6 61.3 85.9 1531.3 67.7 63.6 35.4
    InternVL-Chat IViT-6B-224px Vicuna-13B 336 80.2 63.9 54.6 70.1 58.7 87.1 1546.9 66.5 61.9 33.7
    InternVL-Chat IViT-6B-448px Vicuna-13B 448 82.0 64.1 60.1 71.6 64.8 87.2 1579.0 68.2 64.0 36.7

Citation


  @article{chen2023internvl,
      title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
      author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
      journal={arXiv preprint arXiv:2312.14238},
      year={2023}
  }
  @article{chen2024far,
    title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
    author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
    journal={arXiv preprint arXiv:2404.16821},
    year={2024}
  }
  

πŸ”™ Go Back