Table of Contents
Fetching ...

Asymmetric Duos: Sidekicks Improve Uncertainty

Tim G. Zhou, Evan Shelhamer, Geoff Pleiss

TL;DR

This work tackles the high computational cost of uncertainty-aware inference with large pre-trained models by introducing Asymmetric Duos, which pair a large base model with a smaller sidekick and aggregate their predictions using temperature-weighted logits $f_{\text{Duo}}(X)=f_{\text{large}}(X)\cdot T_{\text{large}}+f_{\text{small}}(X)\cdot T_{\text{small}}$. Through calibrated temperature tuning on a validation set, the framework yields improved accuracy, uncertainty quantification, and selective classification metrics at only $10$-$20\%$ extra FLOPs across five image classification benchmarks, with results robust to model soup integrations and compatible with transfer-learning workflows. The study provides comprehensive ablations (unweighted vs. weighted, UQ-only variants), Computational-Cost analyses via FLOPs balance, and extensive comparisons to deep ensembles, showing that Duos can closely match or exceed ensemble performance with far lower compute. The practical impact is substantial: deployable, flexible, and uncertainty-aware improvements for large-tuned vision models without prohibitive training-time costs.

Abstract

The go-to strategy to apply deep networks in settings where uncertainty informs decisions--ensembling multiple training runs with random initializations--is ill-suited for the extremely large-scale models and practical fine-tuning workflows of today. We introduce a new cost-effective strategy for improving the uncertainty quantification and downstream decisions of a large model (e.g. a fine-tuned ViT-B): coupling it with a less accurate but much smaller "sidekick" (e.g. a fine-tuned ResNet-34) with a fraction of the computational cost. We propose aggregating the predictions of this Asymmetric Duo by simple learned weighted averaging. Surprisingly, despite their inherent asymmetry, the sidekick model almost never harms the performance of the larger model. In fact, across five image classification benchmarks and a variety of model architectures and training schemes (including soups), Asymmetric Duos significantly improve accuracy, uncertainty quantification, and selective classification metrics with only ${\sim}10-20\%$ more computation.

Asymmetric Duos: Sidekicks Improve Uncertainty

TL;DR

This work tackles the high computational cost of uncertainty-aware inference with large pre-trained models by introducing Asymmetric Duos, which pair a large base model with a smaller sidekick and aggregate their predictions using temperature-weighted logits . Through calibrated temperature tuning on a validation set, the framework yields improved accuracy, uncertainty quantification, and selective classification metrics at only - extra FLOPs across five image classification benchmarks, with results robust to model soup integrations and compatible with transfer-learning workflows. The study provides comprehensive ablations (unweighted vs. weighted, UQ-only variants), Computational-Cost analyses via FLOPs balance, and extensive comparisons to deep ensembles, showing that Duos can closely match or exceed ensemble performance with far lower compute. The practical impact is substantial: deployable, flexible, and uncertainty-aware improvements for large-tuned vision models without prohibitive training-time costs.

Abstract

The go-to strategy to apply deep networks in settings where uncertainty informs decisions--ensembling multiple training runs with random initializations--is ill-suited for the extremely large-scale models and practical fine-tuning workflows of today. We introduce a new cost-effective strategy for improving the uncertainty quantification and downstream decisions of a large model (e.g. a fine-tuned ViT-B): coupling it with a less accurate but much smaller "sidekick" (e.g. a fine-tuned ResNet-34) with a fraction of the computational cost. We propose aggregating the predictions of this Asymmetric Duo by simple learned weighted averaging. Surprisingly, despite their inherent asymmetry, the sidekick model almost never harms the performance of the larger model. In fact, across five image classification benchmarks and a variety of model architectures and training schemes (including soups), Asymmetric Duos significantly improve accuracy, uncertainty quantification, and selective classification metrics with only more computation.

Paper Structure

This paper contains 38 sections, 4 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Overview of Asymmetric Duos. (Left) Schematic of Duos vs. single models and deep ensembles. (Right) Gains from Duos where the sidekick adds only 10%--20% more FLOPs depending on the choice of models. Asymmetric Duos improve upon their base models across accuracy, uncertainty quantification, and selective classification metrics on in-distribution (IND) and out-of-distribution (OOD) data (see Sec. \ref{['sec:Experiment_Result']} for experiment details). We plot relative improvements toward perfect scores; shading marks the standard deviation across the choice of base and sidekick models.
  • Figure 2: Model Size Correlates with Accuracy. Performance improves with higher FLOPs and parameter counts across many models on multiple benchmarks. Our Duos allow us to adjust size: combining a bigger $f_{\mathrm{large}}$ with a littler $f_{\mathrm{small}}$ adds just a little computation, or more, as we choose.
  • Figure 3: Class Prediction (Accuracy and F1 $\uparrow$) as a function of FLOPs balance. ($\text{Balance} = 0$ corresponds to cost of a single $f_{\mathrm{large}}$ model; $1$ corresponds to the cost of a $m=2$ ensemble.) Asymmetric Duos almost always increase accuracy over a single $f_{\mathrm{large}}$, even for Duos that combine $f_{\mathrm{large}}$ with a $f_{\mathrm{small}}$ that is $1/10^\mathrm{th}$ the size. The learned temperature weighting is crucial to achieve this performance; if the predictions of $f_{\mathrm{large}}$ and $f_{\mathrm{small}}$ are averaged with equally weight (Duo: Unweighted) then imbalanced Duos may have significantly worse performance than single models.
  • Figure 4: Correctness prediction as measured by AUROC ($\uparrow$), which captures the separability of correct and incorrect predictions by uncertainty. In almost all cases, Asymmetric Duos achieve higher AUROC than corresponding $f_{\mathrm{large}}$ models in isolation, even when using a negligible amount of additional computation. We confirm that this increase cannot be attributed to accuracy improvements alone: our ablation (Duo: UQ only) that uses the Duo's uncertainty measure to separate the $f_{\mathrm{large}}$ class prediction produces a comparable AUROC increase.
  • Figure 5: Selective classification performance as measured by AURC ($\downarrow$), which averages the error across classification coverage levels/abstaining rates. Duos significantly improve this metric while adding as little as $10\%$ additional computation. As with our correctness prediction results (Figure \ref{['fig:AUROC']}), our UQ Only ablation confirms that these improvements cannot be solely attributed to increases in accuracy.
  • ...and 13 more figures