Revisiting Mixout: An Overlooked Path to Robust Finetuning

Masih Aminbeidokhti; Heitor Rapela Medeiros; Eric Granger; Marco Pedersoli

Revisiting Mixout: An Overlooked Path to Robust Finetuning

Masih Aminbeidokhti, Heitor Rapela Medeiros, Eric Granger, Marco Pedersoli

TL;DR

This work addresses robustness of finetuning foundation models under distribution shift by reinterpreting Mixout as an implicit weight-space ensemble and identifying three key levers: the masking anchor, resampling frequency, and mask sparsity. Guided by this analysis, the authors propose GMixout, which (i) replaces the fixed pretrained anchor with an exponential moving-average snapshot and (ii) exposes a resampling-frequency hyperparameter, all implemented with sparse kernels for efficiency. Empirical results across covariate shift, corruption, and class-imbalance benchmarks (including ImageNet, ImageNet-LT, DomainNet, iWildCam, CIFAR100-C) show GMixout consistently improves OOD robustness while maintaining or improving ID accuracy, often surpassing Model Soups and other PEFT baselines, with no inference-time cost. The approach offers a practical, scalable robustness tool for finetuning large vision models on consumer hardware, combining ensemble-like benefits with the efficiency of PEFT.

Abstract

Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revisit Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emph{masking anchor}, \emph{resampling frequency}, and \emph{mask sparsity}. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.

Revisiting Mixout: An Overlooked Path to Robust Finetuning

TL;DR

Abstract

Revisiting Mixout: An Overlooked Path to Robust Finetuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)