Stylized Meta-Album: Group-bias injection with style transfer to study robustness against distribution shifts
Romain Mussard, Aurélien Gauffre, Ihsan Ullah, Thanh Gia Hieu Khuong, Massih-Reza Amini, Isabelle Guyon, Lisheng Sun-Hosoya
TL;DR
SMA introduces a large, configurable meta-dataset that pairs 12 content datasets with 20 stylistic variants to create 4,800 stylized groups across 12 content domains. It enables controlled studies of distribution shifts, fairness, and domain adaptation by explicitly manipulating content and style factors, and it demonstrates this with two benchmarks: group fairness and unsupervised domain adaptation, showing that increasing group diversity affects algorithm rankings and reduces variability in results. The work also proposes Top-M worst-group accuracy as a more stable model-selection metric under high group diversity and highlights SMA's potential to drive more robust, fair, and generalizable vision methods. While offering substantial methodological benefits, SMA acknowledges limitations such as style-transfer artifacts and computational cost, and emphasizes its role as a benchmarking resource to probe robustness rather than a direct production dataset.
Abstract
We introduce Stylized Meta-Album (SMA), a new image classification meta-dataset comprising 24 datasets (12 content datasets, and 12 stylized datasets), designed to advance studies on out-of-distribution (OOD) generalization and related topics. Created using style transfer techniques from 12 subject classification datasets, SMA provides a diverse and extensive set of 4800 groups, combining various subjects (objects, plants, animals, human actions, textures) with multiple styles. SMA enables flexible control over groups and classes, allowing us to configure datasets to reflect diverse benchmark scenarios. While ideally, data collection would capture extensive group diversity, practical constraints often make this infeasible. SMA addresses this by enabling large and configurable group structures through flexible control over styles, subject classes, and domains-allowing datasets to reflect a wide range of real-world benchmark scenarios. This design not only expands group and class diversity, but also opens new methodological directions for evaluating model performance across diverse group and domain configurations-including scenarios with many minority groups, varying group imbalance, and complex domain shifts-and for studying fairness, robustness, and adaptation under a broader range of realistic conditions. To demonstrate SMA's effectiveness, we implemented two benchmarks: (1) a novel OOD generalization and group fairness benchmark leveraging SMA's domain, class, and group diversity to evaluate existing benchmarks. Our findings reveal that while simple balancing and algorithms utilizing group information remain competitive as claimed in previous benchmarks, increasing group diversity significantly impacts fairness, altering the superiority and relative rankings of algorithms. We also propose to use \textit{Top-M worst group accuracy} as a new hyperparameter tuning metric, demonstrating broader fairness during optimization and delivering better final worst-group accuracy for larger group diversity. (2) An unsupervised domain adaptation (UDA) benchmark utilizing SMA's group diversity to evaluate UDA algorithms across more scenarios, offering a more comprehensive benchmark with lower error bars (reduced by 73\% and 28\% in closed-set setting and UniDA setting, respectively) compared to existing efforts. These use cases highlight SMA's potential to significantly impact the outcomes of conventional benchmarks.
