Table of Contents
Fetching ...

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Siyuan Li, Juanxi Tian, Zedong Wang, Luyuan Zhang, Zicheng Liu, Weiyang Jin, Yang Liu, Baigui Sun, Stan Z. Li

TL;DR

It is shown that canonical CNNs, such as VGG and ResNet, exhibit a marked co-dependency with SGD families, while recent architectures like ViTs and ConvNeXt share a tight coupling with the adaptive learning rate ones, and takeaways on recommended optimizers and insights into robust vision backbone architectures are summarized.

Abstract

This paper delves into the interplay between vision backbones and optimizers, unvealing an inter-dependent phenomenon termed \textit{\textbf{b}ackbone-\textbf{o}ptimizer \textbf{c}oupling \textbf{b}ias} (BOCB). We observe that canonical CNNs, such as VGG and ResNet, exhibit a marked co-dependency with SGD families, while recent architectures like ViTs and ConvNeXt share a tight coupling with the adaptive learning rate ones. We further show that BOCB can be introduced by both optimizers and certain backbone designs and may significantly impact the pre-training and downstream fine-tuning of vision models. Through in-depth empirical analysis, we summarize takeaways on recommended optimizers and insights into robust vision backbone architectures. We hope this work can inspire the community to question long-held assumptions on backbones and optimizers, stimulate further explorations, and thereby contribute to more robust vision systems. The source code and models are publicly available at https://bocb-ai.github.io/.

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

TL;DR

It is shown that canonical CNNs, such as VGG and ResNet, exhibit a marked co-dependency with SGD families, while recent architectures like ViTs and ConvNeXt share a tight coupling with the adaptive learning rate ones, and takeaways on recommended optimizers and insights into robust vision backbone architectures are summarized.

Abstract

This paper delves into the interplay between vision backbones and optimizers, unvealing an inter-dependent phenomenon termed \textit{\textbf{b}ackbone-\textbf{o}ptimizer \textbf{c}oupling \textbf{b}ias} (BOCB). We observe that canonical CNNs, such as VGG and ResNet, exhibit a marked co-dependency with SGD families, while recent architectures like ViTs and ConvNeXt share a tight coupling with the adaptive learning rate ones. We further show that BOCB can be introduced by both optimizers and certain backbone designs and may significantly impact the pre-training and downstream fine-tuning of vision models. Through in-depth empirical analysis, we summarize takeaways on recommended optimizers and insights into robust vision backbone architectures. We hope this work can inspire the community to question long-held assumptions on backbones and optimizers, stimulate further explorations, and thereby contribute to more robust vision systems. The source code and models are publicly available at https://bocb-ai.github.io/.
Paper Structure (41 sections, 1 equation, 15 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 1 equation, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: Vision backbones with representative macro and micro designs since 2012. (a) Primary CNNs like VGG laid the foundation for vision backbone design, i.e., multi-layer networks built by plainly stacking building blocks. (b) Classical CNNs like ResNet identified the overall framework of vision backbones as hierarchical stages, each comprising stacked bottlenecks connected by overlapped downsampling layers. (c) Modern DNNs introduced different intra-block structures while presenting two main groups of stage-wise design: hierarchical and isotropic stages with downsampling and patchifying. We summarize all the technical details of these typical vision backbones in Table \ref{['tab:app_backbone']}.
  • Figure 2: Overview of mainstream gradient-based optimizers, which are categorized by the techniques of learning rate adjustment (step 2) and gradient estimation (step 3) in Algorithm \ref{['alg:optimizer_update']}. (a) and (d) only optionally employs step 2 (momentum gradients) or step 3 (adaptive learning rates), while (b) and (c) consider both of them. (b) employs adaptive learning rates by estimating second moments; (c) estimates the dynamic learning rate by other gradient components except for the second moments.
  • Figure 3: Violinplot of the performance stability for different backbones. We visualize the results in Table \ref{['tab:cifar100_backbone']} as violinplots to show the performance stability of different vision backbones. In particular, favorable backbones should not only achieve great performance (high mean accuracy) with few optimizers but yield a small performance variance (a flat distribution without outliers). Note that grey dots denote the outliers (backbone-optimizer combination with poor results), revealing the phenomenon of BOCB. We suggest that well-designed (vision) backbones should exhibit both superior performance and great performance stability across optimizers to mitigate the risk of BOCB.
  • Figure 4: Boxplot visualization of hyper-parameter robustness (learning rate and weight decay) for various backbones on CIFAR-100. The vertical axis denotes variation (measured by Manhattan distances) of all optimal hyper-parameters for certain backbones across different optimizers to the default (mode) values. Holistically, backbones with larger mean and variance of variations (e.g., AlexNet, EfficientNet-B0, ConvNeXt-T, and ConvFomer-S12) require more tuning efforts in practice and may be tough to adapt to new or poorly-studied optimizers and tasks. In contrast, models with low variation maximum while the small medians (e.g., ResNet-50, RepVGG-A1, and CAFormer-S12) are regarded as more robust and with more favorable optimization behavior from the view of optimizers.
  • Figure 5: Boxplot of optimizers generality across different backbones on CIFAR-100. Symmetrical to Figure \ref{['fig:box_backbone_hyper']}, the analysis scope here is switched from backbones to optimizers so as to showcase the optimizer's generality from the perspectives of hyper-parameter robustness. Some optimizers in Category (b) show favorable robustness (e.g., AdamW and LAMB). Contrastively, several optimizers in the other three types show poor generality (e.g., SGDP, AdaBound, and LARS), which are excluded from our further discussion on the connection between BOCB and diverse vision backbone designs.
  • ...and 10 more figures