Table of Contents
Fetching ...

ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu

TL;DR

This work addresses the inadequacy of relying on ImageNet accuracy to compare vision models. It conducts a structured, cross-architecture analysis of ConvNeXt and ViT under supervised and CLIP training across seven properties (mistakes, bias, calibration, robustness, transferability, synthetic-data performance, and invariance) using pretrained, non-finetuned models. Key findings show CLIP improves transferability and reduces some error types but is more overconfident; ConvNeXt offers better calibration, robustness to many ImageNet variants, and strong invariance, with synthetic-data advantages as well. The results advocate for multi-faceted benchmarks to guide practical model selection, recommending supervised ConvNeXt for ImageNet-like tasks and CLIP when distribution shifts are expected.

Abstract

Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models. Our code is available at https://github.com/kirill-vish/Beyond-INet.

ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

TL;DR

This work addresses the inadequacy of relying on ImageNet accuracy to compare vision models. It conducts a structured, cross-architecture analysis of ConvNeXt and ViT under supervised and CLIP training across seven properties (mistakes, bias, calibration, robustness, transferability, synthetic-data performance, and invariance) using pretrained, non-finetuned models. Key findings show CLIP improves transferability and reduces some error types but is more overconfident; ConvNeXt offers better calibration, robustness to many ImageNet variants, and strong invariance, with synthetic-data advantages as well. The results advocate for multi-faceted benchmarks to guide practical model selection, recommending supervised ConvNeXt for ImageNet-like tasks and CLIP when distribution shifts are expected.

Abstract

Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models. Our code is available at https://github.com/kirill-vish/Beyond-INet.
Paper Structure (13 sections, 9 figures, 9 tables)

This paper contains 13 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Models are often compared only by their ImageNet accuracy, without looking at many other important behaviors. In our work, we analyze models with similar ImageNet accuracies and find that they have vastly different properties.
  • Figure 2: Model mistakes on ImageNet-X. Lower is better. ConvNeXt and ViT perform similarly within each training category. CLIP models achieve lower error ratios compared to supervised.
  • Figure 2: Transferability results on VTAB in subgroups. CLIP models are better on each of the dataset subgroups. For supervised models, ConvNeXt outperforms ViT by a large margin.
  • Figure 3: Fraction of shape vs texture decisions on cue-conflict dataset. ViT models show a higher shape bias. CLIP models are less texture-biased than their supervised counterparts. All models still have a significant fraction of texture decisions.
  • Figure 4: A cue-conflict image geirhos2018imagenet.
  • ...and 4 more figures