Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning

Ximei Wang; Junwei Pan; Xingzhuo Guo; Dapeng Liu; Jie Jiang

Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning

Ximei Wang, Junwei Pan, Xingzhuo Guo, Dapeng Liu, Jie Jiang

TL;DR

Multi-domain learning (MDL) must contend with dataset bias across domains and domain domination by head-dominant domains. The authors propose Decoupled Training (D-Train), a tri-phase general-to-specific strategy built on a shared-bottom backbone: (1) Pre-train on all domains to learn a root model $(\psi_0,h_0)$, (2) Post-train by splitting into domain-specific heads while sharing the backbone, and (3) Fine-tune with a fixed backbone to achieve domain independence $(\widehat h_t)$. Across Office-Home, DomainNet, FMoW, and Amazon, D-Train outperforms domain-alignment and mixture-of-experts baselines, with consistent gains on both average and worst-domain metrics and the ability to plug into existing MDL methods. An online Tencent DSP deployment shows tangible gains in cost and GMV, underscoring practical impact. The method demonstrates that a decoupled, hyperparameter-free, stage-wise optimization can mitigate the seesaw effect in MDL, enhancing scalability and deployability.

Abstract

Multi-domain learning (MDL) aims to train a model with minimal average risk across multiple overlapping but non-identical domains. To tackle the challenges of dataset bias and domain domination, numerous MDL approaches have been proposed from the perspectives of seeking commonalities by aligning distributions to reduce domain gap or reserving differences by implementing domain-specific towers, gates, and even experts. MDL models are becoming more and more complex with sophisticated network architectures or loss functions, introducing extra parameters and enlarging computation costs. In this paper, we propose a frustratingly easy and hyperparameter-free multi-domain learning method named Decoupled Training (D-Train). D-Train is a tri-phase general-to-specific training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multi-heads, and finally fine-tunes the heads by fixing the backbone, enabling decouple training to achieve domain independence. Despite its extraordinary simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems.

Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning

TL;DR

, (2) Post-train by splitting into domain-specific heads while sharing the backbone, and (3) Fine-tune with a fixed backbone to achieve domain independence

. Across Office-Home, DomainNet, FMoW, and Amazon, D-Train outperforms domain-alignment and mixture-of-experts baselines, with consistent gains on both average and worst-domain metrics and the ability to plug into existing MDL methods. An online Tencent DSP deployment shows tangible gains in cost and GMV, underscoring practical impact. The method demonstrates that a decoupled, hyperparameter-free, stage-wise optimization can mitigate the seesaw effect in MDL, enhancing scalability and deployability.

Abstract

Paper Structure (24 sections, 4 equations, 6 figures, 5 tables)

This paper contains 24 sections, 4 equations, 6 figures, 5 tables.

Introduction
Related Work
Seeking Commonalities
Reserving Differences
Approach
Pre-train: Warm Up a Root Model
Post-Train: Split Into Multi-Heads
Fine-tune: Decouple-Train for Independence
Why Does D-Train Work?
Experiments
Standard Benckmarks
Low-data Regime: Office-Home
Large-Scale Dataset: DomainNet
Applications of Satellite Imagery
Applications of Recommender System
...and 9 more sections

Figures (6)

Figure 1: (a)-(b): Review examples from a recommender system benchmark named Amazon with various styles and keywords (shown by different colors), as well as visual examples from a computer vision dataset named Office-Home with various appearances and backgrounds, reveal the main challenge of dataset bias. (c)-(d): The distribution of sample number across domains is naturally imbalanced or even long-tailed, indicating another major challenge of domain domination.
Figure 2: (a): Seeking Commonalities by aligning distributions across domains to reduce domain gap. (b): Reserving Differences by implementing domain-specific towers, gates, and even experts.
Figure 3: (a): Explanations of different phases of D-Train. $\psi$ denotes the feature extractor; $h$ denotes the shared head in the pre-training phase; $\{{h}_{1}, {h}_{2},..., {h}_{T}\}$ denote the domain-specific heads at the next two phases. During the fine-tuning phase, the parameters of the feature extractor are fixed. (b): The training curves of each phase of D-Train on various domains: Clipart, Painting, Real and Sketch as shown in dotted lines, while the average accuracy over domains is shown in solid line.
Figure 4: Visualization on domain domination where ${h}_0$ is the domain-agnostic head at the pre-training phase. $\| \cdot \|$ denotes the Euclidean norm of the parameter update.
Figure 5: Decision boundaries on Two-Moon, where numbers in the legend indicate the accuracy of each domain.
...and 1 more figures

Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning

TL;DR

Abstract

Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)