Table of Contents
Fetching ...

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen

TL;DR

This work establishes an evaluation protocol for vision models in the vision-language era using DataComp-1B and CLIP, then introduces ViTamin, a three-stage hybrid backbone combining MBConv-LN and Transformer blocks to improve data and model scalability. ViTamin-L outperforms ViT-L/14 in zero-shot ImageNet and achieves competitive 38-task averages, while ViTamin-XL reaches $82.9\%$ zero-shot accuracy with far fewer parameters than prior large models. A Locked-Text Tuning strategy further boosts small variants by up to $+4$–$+5\%$, and up to $+23.3\%$ with data constraints, emphasizing the value of co-designing data and architecture. The approach yields state-of-the-art results on open-vocabulary detection and segmentation and strong performance in large multimodal setups, highlighting ViTamin as a practical backbone for scalable vision-language models.

Abstract

Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models, covering their zero-shot performance and scalability in both model and training data sizes. To this end, we introduce ViTamin, a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks, including classification, retrieval, open-vocabulary detection and segmentation, and large multi-modal models. When further scaling up the model size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

TL;DR

This work establishes an evaluation protocol for vision models in the vision-language era using DataComp-1B and CLIP, then introduces ViTamin, a three-stage hybrid backbone combining MBConv-LN and Transformer blocks to improve data and model scalability. ViTamin-L outperforms ViT-L/14 in zero-shot ImageNet and achieves competitive 38-task averages, while ViTamin-XL reaches zero-shot accuracy with far fewer parameters than prior large models. A Locked-Text Tuning strategy further boosts small variants by up to , and up to with data constraints, emphasizing the value of co-designing data and architecture. The approach yields state-of-the-art results on open-vocabulary detection and segmentation and strong performance in large multimodal setups, highlighting ViTamin as a practical backbone for scalable vision-language models.

Abstract

Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models, covering their zero-shot performance and scalability in both model and training data sizes. To this end, we introduce ViTamin, a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks, including classification, retrieval, open-vocabulary detection and segmentation, and large multi-modal models. When further scaling up the model size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).
Paper Structure (23 sections, 1 equation, 4 figures, 20 tables)

This paper contains 23 sections, 1 equation, 4 figures, 20 tables.

Figures (4)

  • Figure 1: Practices of designing scalable vision models in the vision-language era. We benchmark modern vision models with various model and data scales under CLIP setting using DataComp-1B gadre2023datacomp, leading to findings about data and model scalability, feature resolution, and hybrid architecture, which motivate us to develop ViTamin for VLM. ViTamin-L achieves superior zero-shot performance over ViT-L/14 li2023clipa on ImageNet russakovsky2015imagenet and average 38 datasets gadre2023datacomp, and advances a suite of 22 downstream tasks for Open-Vocabulary (OV) detection wu2023clipself and segmentation yu2023convolutions, and Large Multi-modal Model (LMM) tasks liu2023improvedllava.
  • Figure 2: Benchmarking vision models under CLIP setting on DataComp-1B, including ViT (a pure Transformer), ConvNeXt (a pure ConvNet), and CoAtNet (a hybrid model). We examine their scalability in terms of both data sizes (1st row) and model scales (2nd row), and further analyze the results from the aspects of feature resolution (3rd row) and hybrid architecture (4th row).
  • Figure 3: Overview of ViTamin architecture. (a) ViTamin begins with a convolutional stem, followed by Mobile Convolution Blocks (MBConv) in stage 1 and 2, and Transformer Blocks (TFB) in stage 3. The 2D input to the stage 3 is flattened to 1D. For the macro-level designs, the three-stage layout generates the final feature map with output stride 16, similar to ViT/16 dosovitskiy2020image. We set channels sizes for the three stages to be ($C$, $2C$, $6C$). For the micro-level designs, the employed MBConv-LN modifies MBConv sandler2018mobilenetv2 by using a single LayerNorm ba2016layer. TFB-GeGLU upgrades TFB's FFNs vaswani2017attention (Feed-Forward Networks) with GELU Gated Linear Units shazeer2020glu. (b) In the CLIP framework, given $N$ image-text pairs, the vision model's output $I_i$ is learned to align with its corresponding text Transformer's output $T_i$. Our text Transformers are the same as OpenCLIP ilharco_gabriel_2021_5143773. +: Addition. *: Multiplication.
  • Figure 4: Locked-text tuning (LTT). LTT exploits a pretrained frozen text encoder, and effectively boosts the model performance.