Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters

Gyuseong Lee; Wooseok Jang; Jinhyeon Kim; Jaewoo Jung; Seungryong Kim

Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters

Gyuseong Lee, Wooseok Jang, Jinhyeon Kim, Jaewoo Jung, Seungryong Kim

TL;DR

This work addresses domain generalization (DG) by leveraging large pretrained visions models with parameter-efficient fine-tuning (PEFT) to improve robustness on unseen domains. It introduces Mixture-of-Adapters (MoA), an adapter-based mixture-of-experts framework with learnable routers that directs inputs to adapters of varying capacity, enabling adjustable regularization without full fine-tuning. Through loss-landscape and Hessian analyses, the authors show PEFT yields flatter optimization surfaces and smaller curvature, correlating with better DG performance; KA (KAdaptation) often provides the strongest gains, and MoA further enhances results. On standard DG benchmarks, their approach achieves state-of-the-art results with modest training costs, illustrating the practical value of combining PEFT with MoA and large-scale pretrained models for robust domain generalization.

Abstract

Learning robust vision models that perform well in out-of-distribution (OOD) situations is an important task for model deployment in real-world settings. Despite extensive research in this field, many proposed methods have only shown minor performance improvements compared to the simplest empirical risk minimization (ERM) approach, which was evaluated on a benchmark with a limited hyperparameter search space. Our focus in this study is on leveraging the knowledge of large pretrained models to improve handling of OOD scenarios and tackle domain generalization problems. However, prior research has revealed that naively fine-tuning a large pretrained model can impair OOD robustness. Thus, we employ parameter-efficient fine-tuning (PEFT) techniques to effectively preserve OOD robustness while working with large models. Our extensive experiments and analysis confirm that the most effective approaches involve ensembling diverse models and increasing the scale of pretraining. As a result, we achieve state-of-the-art performance in domain generalization tasks. Our code and project page are available at: https://cvlab-kaist.github.io/MoA

Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters

TL;DR

Abstract

Paper Structure (31 sections, 3 equations, 12 figures, 6 tables)

This paper contains 31 sections, 3 equations, 12 figures, 6 tables.

Introduction
Related Work
Domain Generalization
Parameter efficient fine-tuning
Mixture-of-Experts
Preliminaries
Domain Generalization
Parameter Efficient Adapter
The Effectiveness of Parameter-Efficient Fine-Tuning in Domain Generalization
Loss Landscapes
Maximum Hessian Eigenvalue Spectra
Parameter-Efficient Adapter for Domain Generalization
Mixture-of-Adapters (MoA)
Experimental Results
Experimental Setting
...and 16 more sections

Figures (12)

Figure 1: Results on domain generalization benchmarks with varying trainable parameters in ViT-B/16 dosovitskiy2020image, pretrained on a private OpenAI dataset radford2021learning. The y-axis indicates accuracy. We use linear probing (denoted as Linear), bias tuning in the attention layer (Bias (MSA)), bias tuning in both the attention and MLP layers (Bias (MSA+MLP)), and full fine-tuning to illustrate how accuracy changes with different trainable parameters when applying PEFT methods to large models. OH, TI, DN denotes OfficeHome venkateswara2017deep, TerraIncognita beery2018recognition, and DomainNet peng2019moment, respectively.
Figure 2: Flatness comparison of loss surfaces from models trained with full fine-tuning, LoRA, KAdaptation, and KAdaptation with Mixture-of-Adapter (denoted as KMoA) on PACS dataset li2017deeper. All visualizations are computed from test environment 0 (Art) domain. The x and y axes (plane) in each figure represent the perturbation directions of the model weight, and the z axis (height) represents the change in loss value according to the weight perturbation.
Figure 3: A comparison of the max Hessian eigenvalue spectra trained with full fine-tuning, LoRA, KAdaptation, and MoA with KAdaptation on the PACS dataset li2017deeper. The x-axis represents the maximum Hessian eigenvalue, while the y-axis represents its density. KA and KMoA refer to KAdaptation and Mixture-of-Adapter with KAdaptation methods, respectively. TE0 to TE3 represent each testing domain from the domain generalization dataset. For example, in the PACS dataset, TE0 to TE3 correspond to art_painting, cartoon, photo, and sketch, respectively. In all test environments, KMoA exhibits the most zero-concentrated eigenvalue spectrum. KA and LoRA also show a smaller max Hessian eigenvalue distribution compared to full fine-tuning (w/o LoRA). Note that the x-axis is highly magnified, making w/ LoRA and w/o LoRA appear almost flat. However, when the x-scale is expanded, both w/ LoRA and w/o LoRA exhibit a zero-concentrated shape similar to w/ KA and w/ KMoA.
Figure 4: Architecture of the proposed Mixture-of-Adapters (MoA). $\mathbf{W}_0$, $\mathbf{x}_\mathrm{in}$, and $\mathbf{x}_\mathrm{out}$ denotes original pretrained weight, input, and output tokens in multi-head self-attention (MHSA). The Adapter in (a) can refer to any adapter-based PEFT method, such as LoRA, Compacter, or KAdaptation. Additionally, the Router in (b) can be a linear or cosine router, as commonly used in Mixture-of-Expert methods.
Figure 5: Performance comparison between the original (not fine-tuned) CLIP and the ImageNet fine-tuned CLIP from timm rw2019timm (denoted as CLIP-FT) using different fine-tuning strategies. Full, Att., LoRA, and KA denotes full fine-tuning, attention-only tuning touvron2022three, LoRA, and KAdaptation, respectively.
...and 7 more figures

Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters

TL;DR

Abstract

Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters

Authors

TL;DR

Abstract

Table of Contents

Figures (12)