Table of Contents
Fetching ...

High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

Masih Aminbeidokhti, Heitor Rapela Medeiros, Srikanth Muralidharan, Eric Granger, Marco Pedersoli

TL;DR

This paper tackles domain generalization for vision models under distribution shifts, aiming to achieve ensemble-level robustness without training and storing multiple models. It introduces High-rate Mixout, a stochastic regularizer that aggressively swaps fine-tuned weights with pre-trained weights during training, with masking rates around $p \,\approx\,0.9$ for ViTs and $p \,\approx\,0.8$ for ResNets, and it extends the approach with structured masking by swapping entire convolutional kernels in CNNs. The authors establish a weight-space equivalence to ensembling and show that a deterministic weight-scaling inference approximates the ensemble behavior, enabling single-run training to match ensemble performance on five DomainBed benchmarks (PACS, VLCS, OfficeHome, TerraIncognita, DomainNet). Empirical results demonstrate that High-rate Mixout delivers comparable out-of-domain accuracy to ensembles while reducing gradient computation by up to 45% and gradient memory by up to 90%, with CNNs benefiting from kernel-level masking. Overall, the method bridges the gap between performance and efficiency in domain generalization for large-scale pretrained models, offering practical benefits for robust deployment.

Abstract

Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.

High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

TL;DR

This paper tackles domain generalization for vision models under distribution shifts, aiming to achieve ensemble-level robustness without training and storing multiple models. It introduces High-rate Mixout, a stochastic regularizer that aggressively swaps fine-tuned weights with pre-trained weights during training, with masking rates around for ViTs and for ResNets, and it extends the approach with structured masking by swapping entire convolutional kernels in CNNs. The authors establish a weight-space equivalence to ensembling and show that a deterministic weight-scaling inference approximates the ensemble behavior, enabling single-run training to match ensemble performance on five DomainBed benchmarks (PACS, VLCS, OfficeHome, TerraIncognita, DomainNet). Empirical results demonstrate that High-rate Mixout delivers comparable out-of-domain accuracy to ensembles while reducing gradient computation by up to 45% and gradient memory by up to 90%, with CNNs benefiting from kernel-level masking. Overall, the method bridges the gap between performance and efficiency in domain generalization for large-scale pretrained models, offering practical benefits for robust deployment.

Abstract

Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.

Paper Structure

This paper contains 27 sections, 14 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Comparison between in-domain (left) and out-of-domain (right) accuracy of Mixout and Dropout on OfficeHome dataset with ViT-S/16 architecture. While for in-domain the two approaches reach a similar best accuracy, for out-domain the ability of Mixout to preserve the knowledge of the pre-trained model leads to accuracies comparable to ensemble models with a single training. We show Dropout results for a probability of 0.1, as performance rapidly declines to zero for higher probabilities in both cases. Mixout at the probability of 0.0 is equivalent to ERM.
  • Figure 2: Difference between structural (a) and unstructured (b) Mixout. The top shows convolutional filters, while the bottom shows neurons.
  • Figure 3: Performance versus computational and memory cost for the backward pass across different architectures and methods. The x-axis is shown on a logarithmic scale. Each bubble area is proportional to the gradient memory usage during the training of a model. High-rate Mixout achieves competitive performance compared to ensemble-based methods while requiring significantly less computation and memory during training.
  • Figure 4: Comparison between in-domain and out-domain accuracy of Mixout and Dropout/DropFilter on OfficeHome dataset with ResNet50 architecture. Unlike ViT-S/16, Mixout with structured masking is better for both the in and out domain performance. We show Dropout and DropFilter results for a probability of 0.1 and 0.05, respectively, as performance rapidly declines to zero for higher probabilities in both cases. Mixout at the probability of 0.0 is equivalent to ERM.
  • Figure 5: Test accuracy as a function of the number of MC samples $k$ averaged at inference time (blue dashed curves) for a model trained with High-rate Mixout and evaluated on the Art domain from OfficeHome. For reference, the computationally cheaper single-pass weight scaling approximation is indicated by red horizontal lines.