Table of Contents
Fetching ...

SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities

Yanis Lalou, Théo Gnassounou, Antoine Collas, Antoine de Mathelin, Oleksii Kachaiev, Ambroise Odonnat, Alexandre Gramfort, Thomas Moreau, Rémi Flamary

TL;DR

SKADA-Bench tackles realistic unsupervised domain adaptation evaluation by combining a nested cross-validation framework with diverse, multimodal datasets ( simulated and real ) and a broad set of shallow and deep DA methods. It emphasizes unsupervised model selection scorers (e.g., CircV, IW, MixVal) and analyzes how scorer choice impacts reported gains, revealing that many methods are sensitive to hyperparameter tuning and validation strategy. The benchmark shows simple, robust DA approaches (LinOT, CORAL, JPCA, SA) often outperform more complex mappings, though deep DA can excel on computer vision tasks with modality-specific tuning. By providing open-source tooling and a scalable evaluation protocol, SKADA-Bench offers a practical, extensible foundation for comparing DA methods in real-world, heterogeneous settings.

Abstract

Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift. While many methods have been proposed in the literature, fair and realistic evaluation remains an open question, particularly due to methodological difficulties in selecting hyperparameters in the unsupervised setting. With SKADA-bench, we propose a framework to evaluate DA methods on diverse modalities, beyond computer vision task that have been largely explored in the literature. We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment. Realistic hyperparameter selection is performed with nested cross-validation and various unsupervised model selection scores, on both simulated datasets with controlled shifts and real-world datasets across diverse modalities, such as images, text, biomedical, and tabular data. Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications, with key insights into the choice and impact of model selection approaches. SKADA-bench is open-source, reproducible, and can be easily extended with novel DA methods, datasets, and model selection criteria without requiring re-evaluating competitors. SKADA-bench is available on Github at https://github.com/scikit-adaptation/skada-bench.

SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities

TL;DR

SKADA-Bench tackles realistic unsupervised domain adaptation evaluation by combining a nested cross-validation framework with diverse, multimodal datasets ( simulated and real ) and a broad set of shallow and deep DA methods. It emphasizes unsupervised model selection scorers (e.g., CircV, IW, MixVal) and analyzes how scorer choice impacts reported gains, revealing that many methods are sensitive to hyperparameter tuning and validation strategy. The benchmark shows simple, robust DA approaches (LinOT, CORAL, JPCA, SA) often outperform more complex mappings, though deep DA can excel on computer vision tasks with modality-specific tuning. By providing open-source tooling and a scalable evaluation protocol, SKADA-Bench offers a practical, extensible foundation for comparing DA methods in real-world, heterogeneous settings.

Abstract

Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift. While many methods have been proposed in the literature, fair and realistic evaluation remains an open question, particularly due to methodological difficulties in selecting hyperparameters in the unsupervised setting. With SKADA-bench, we propose a framework to evaluate DA methods on diverse modalities, beyond computer vision task that have been largely explored in the literature. We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment. Realistic hyperparameter selection is performed with nested cross-validation and various unsupervised model selection scores, on both simulated datasets with controlled shifts and real-world datasets across diverse modalities, such as images, text, biomedical, and tabular data. Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications, with key insights into the choice and impact of model selection approaches. SKADA-bench is open-source, reproducible, and can be easily extended with novel DA methods, datasets, and model selection criteria without requiring re-evaluating competitors. SKADA-bench is available on Github at https://github.com/scikit-adaptation/skada-bench.
Paper Structure (39 sections, 11 figures, 25 tables)

This paper contains 39 sections, 11 figures, 25 tables.

Figures (11)

  • Figure 1: Illustration of different type of distribution shift between source and target domains: covariate shift, target shift, conditional shift, and subspace shift (Mathematial details are available in Section \ref{['sec:da_problem']}). Points represent data samples, with colors indicating different classes. These synthetic datasets are used to evaluate model performance under controlled shift scenarios in the experiement part.
  • Figure 2: Visualization of nested cross-validation strategy. Both source and target data are split into an outer loop and then a nested loop. The nested loop tunes hyperparameters for the domain adaptation method, while the outer loop trains a final classifier with the best hyperparameters and evaluates its accuracy on both source and target data. Note: Target sets have no labels during the nested loop, reflecting unsupervised Domain Adaptation.
  • Figure 3: Cross-val score as a function of the accuracy for different supervised and unsupervised scorers. The Pearson correlation coefficient is reported for each scorer by $\rho$. Each point represents an inner split with a DA method (color of the points) and a dataset. A good score should correlate with the target accuracy.
  • Figure 4: Critical difference diagram of average ranks for scorers, computed across shallow methods and shifts (lower ranks indicate better performance). Black lines between scorers indicate pairs that are not statistically different based on the Wilcoxon test.
  • Figure 5: Illustrations as spider plots for all methods of the accuracy on each dataset (left) and the scorers rankings (right). For methods with no accuracy results (NA in Table \ref{['tab:results']}) we replace the value by 0. We provide both spider plot in the same Figure to allow a comparison of the scorer ranking while having the possibility to check the performance for each method.
  • ...and 6 more figures