Table of Contents
Fetching ...

Revisiting Mixout: An Overlooked Path to Robust Finetuning

Masih Aminbeidokhti, Heitor Rapela Medeiros, Eric Granger, Marco Pedersoli

TL;DR

This work addresses robustness of finetuning foundation models under distribution shift by reinterpreting Mixout as an implicit weight-space ensemble and identifying three key levers: the masking anchor, resampling frequency, and mask sparsity. Guided by this analysis, the authors propose GMixout, which (i) replaces the fixed pretrained anchor with an exponential moving-average snapshot and (ii) exposes a resampling-frequency hyperparameter, all implemented with sparse kernels for efficiency. Empirical results across covariate shift, corruption, and class-imbalance benchmarks (including ImageNet, ImageNet-LT, DomainNet, iWildCam, CIFAR100-C) show GMixout consistently improves OOD robustness while maintaining or improving ID accuracy, often surpassing Model Soups and other PEFT baselines, with no inference-time cost. The approach offers a practical, scalable robustness tool for finetuning large vision models on consumer hardware, combining ensemble-like benefits with the efficiency of PEFT.

Abstract

Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revisit Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emph{masking anchor}, \emph{resampling frequency}, and \emph{mask sparsity}. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.

Revisiting Mixout: An Overlooked Path to Robust Finetuning

TL;DR

This work addresses robustness of finetuning foundation models under distribution shift by reinterpreting Mixout as an implicit weight-space ensemble and identifying three key levers: the masking anchor, resampling frequency, and mask sparsity. Guided by this analysis, the authors propose GMixout, which (i) replaces the fixed pretrained anchor with an exponential moving-average snapshot and (ii) exposes a resampling-frequency hyperparameter, all implemented with sparse kernels for efficiency. Empirical results across covariate shift, corruption, and class-imbalance benchmarks (including ImageNet, ImageNet-LT, DomainNet, iWildCam, CIFAR100-C) show GMixout consistently improves OOD robustness while maintaining or improving ID accuracy, often surpassing Model Soups and other PEFT baselines, with no inference-time cost. The approach offers a practical, scalable robustness tool for finetuning large vision models on consumer hardware, combining ensemble-like benefits with the efficiency of PEFT.

Abstract

Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revisit Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emph{masking anchor}, \emph{resampling frequency}, and \emph{mask sparsity}. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.

Paper Structure

This paper contains 24 sections, 10 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: OOD-ID accuracy trade-off on the DomainNet dataset. Models are trained on Sketch (ID) and evaluated on Real, Painting, and Clipart (OOD) data. (a) Each point represents a method, with ID accuracy on the x-axis and OOD accuracy on the y-axis. The ideal method should improve over the zero-shot baseline along both axes. (b) Mixout regulates the ID–OOD trade-off through mask sparsity. (c, d) GMixout extends Mixout with two mechanisms: (i) an EMA coefficient that updates the masking anchor, and (ii) a resampling frequency that determines the number of uncovered subnetworks during optimization (shown in parentheses in (d)). By controlling the variance and covariance of the expected test error, these hyperparameters significantly enhance OOD performance while maintaining competitiveness on ID. Although these additional hyperparameters can significantly affect the OOD trade-off, they remain stable across datasets, and all results are reported using the same settings across experiments (indicated by red circles in the plots).
  • Figure 2: OOD-ID accuracy on ImageNet with varying training set sizes.
  • Figure 3: OOD-ID trade-off given variable $\lambda$ (left) and $k$ (right) on DomainNet Real.
  • Figure 4: OOD-ID trade-off given the variable trainable parameters budget for PEFT methods.
  • Figure 5: Average transfer accuracy on five out-of-task datasets after finetuning the methods on ImageNet-1k.