Table of Contents
Fetching ...

Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?

Robin Hesse, Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Stefan Roth

TL;DR

This work addresses the gap of designing well-behaved image classifiers by evaluating nine quality dimensions across $326$ backbones on ImageNet-1k and introducing the QUBA score for multi-dimensional ranking. It demonstrates that larger training datasets and self-supervised pretraining followed by end-to-end fine-tuning improve most quality dimensions, while vision-language architectures achieve strong class balance and domain robustness; transformers generally outperform CNNs across multiple dimensions. The study provides insights into relationships among quality dimensions and offers flexible, dimension-aware recommendations through QUBA-based rankings, enabling practitioners to tailor models to specific needs. Overall, the paper advocates for evaluating broad quality profiles rather than focusing solely on accuracy, to advance the development of robust, calibrated, fair, and efficient vision systems.

Abstract

Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect these quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high class balance on ImageNet-1k classification and strong robustness against domain changes; (ii) training models initialized with weights obtained through self-supervised learning is an effective strategy to improve most considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.

Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?

TL;DR

This work addresses the gap of designing well-behaved image classifiers by evaluating nine quality dimensions across backbones on ImageNet-1k and introducing the QUBA score for multi-dimensional ranking. It demonstrates that larger training datasets and self-supervised pretraining followed by end-to-end fine-tuning improve most quality dimensions, while vision-language architectures achieve strong class balance and domain robustness; transformers generally outperform CNNs across multiple dimensions. The study provides insights into relationships among quality dimensions and offers flexible, dimension-aware recommendations through QUBA-based rankings, enabling practitioners to tailor models to specific needs. Overall, the paper advocates for evaluating broad quality profiles rather than focusing solely on accuracy, to advance the development of robust, calibrated, fair, and efficient vision systems.

Abstract

Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect these quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high class balance on ImageNet-1k classification and strong robustness against domain changes; (ii) training models initialized with weights obtained through self-supervised learning is an effective strategy to improve most considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.

Paper Structure

This paper contains 26 sections, 13 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Visualization of two of our main results. We compare nine different quality dimensions for popular backbone models trained with standard supervised learning $\blacksquare\!\!\!\!\!\blacksquare$ (SL) against the corresponding backbones trained after initialization with weights obtained through self-supervised learning $\blacksquare\!\!\!\!\!\blacksquare$(left) and when utilized in a vision-language (ViL) model $\blacksquare\!\!\!\!\!\blacksquare$(right). Axis units indicate the distance (in standard deviations) to the mean (0 line) of each quality dimension; see \ref{['eq:quba']} and its explanation for details. Please refer to \ref{['tab:comparisons']} (e) and (j) for raw values and \ref{['sec:experiments_whatmakesbetter']} for an interpretation of the results.
  • Figure 2: Rank correlation matrix for the considered metrics on computational cost for our full model zoo of 326 models. All entries have a $p$-value below 0.05, indicating statistical significance.
  • Figure 3: Different quality dimensions (y axis) vs. accuracy (x axis). To reduce clutter in the plots, we only plot representative models instead of our full model zoo; please refer to the project page for interactive plots with all models. To emphasize the effect of different training strategies and model architectures, we group models visually: the training dataset size is marked by symbols within each marker (no symbol for ImageNet-1k, dot ($\cdot$) for ImageNet-21k, star ($\star$) for large-scale datasets); different training strategies by shapes (standard supervised training as squares , adversarial training as circles , self-supervised (pre-)training as triangles , semi-supervised training as diamonds , A[1,2,3] training as pentagons ); and different architectures by color (blue $\blacksquare\!\!\!\!\!\blacksquare$ for CNNs, orange $\blacksquare\!\!\!\!\!\blacksquare$ for Transformers, green $\blacksquare\!\!\!\!\!\blacksquare$ for B-cos models, and yellow $\blacksquare\!\!\!\!\!\blacksquare$ for vision-language (ViL) models).
  • Figure 4: Rank correlation matrix for the considered quality dimensions among our full model zoo. All non-crossed-out entries have a $p$-value below 0.05, indicating statistical significance. Crossed-out entries correspond to $p$-values above 0.05 and are therefore not statistically significant.
  • Figure 5: Top five QUBA score models under different weightings. We report the top five models when weighing specific (groups of) quality dimensions twice as strongly. See \ref{['fig:scatter_dimensions_acc']} for color and marker details.
  • ...and 9 more figures