High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

Masih Aminbeidokhti; Heitor Rapela Medeiros; Srikanth Muralidharan; Eric Granger; Marco Pedersoli

High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

Masih Aminbeidokhti, Heitor Rapela Medeiros, Srikanth Muralidharan, Eric Granger, Marco Pedersoli

TL;DR

This paper tackles domain generalization for vision models under distribution shifts, aiming to achieve ensemble-level robustness without training and storing multiple models. It introduces High-rate Mixout, a stochastic regularizer that aggressively swaps fine-tuned weights with pre-trained weights during training, with masking rates around $p \,\approx\,0.9$ for ViTs and $p \,\approx\,0.8$ for ResNets, and it extends the approach with structured masking by swapping entire convolutional kernels in CNNs. The authors establish a weight-space equivalence to ensembling and show that a deterministic weight-scaling inference approximates the ensemble behavior, enabling single-run training to match ensemble performance on five DomainBed benchmarks (PACS, VLCS, OfficeHome, TerraIncognita, DomainNet). Empirical results demonstrate that High-rate Mixout delivers comparable out-of-domain accuracy to ensembles while reducing gradient computation by up to 45% and gradient memory by up to 90%, with CNNs benefiting from kernel-level masking. Overall, the method bridges the gap between performance and efficiency in domain generalization for large-scale pretrained models, offering practical benefits for robust deployment.

Abstract

Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.

High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

TL;DR

Abstract

High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)