Table of Contents
Fetching ...

Re-evaluating Group Robustness via Adaptive Class-Specific Scaling

Seonguk Seo, Bohyung Han

TL;DR

This work tackles the persistent trade-off between robust (group-wise) and average accuracies in group robustness methods. It introduces a training-free class-specific scaling as a post-processing step to control this trade-off, and extends it with instance-wise scaling that leverages feature clusters for per-example adjustments. A novel robust coverage metric is proposed to quantify the trade-off along the Pareto frontier, enabling a unified evaluation across methods. Empirical results across computer vision and NLP benchmarks show that simple RS/IRS can match or outperform several debiasing approaches with negligible training overhead, highlighting the potential of post-processing avenues for robust fairness. The framework provides practical guidance for selecting desirable performance points and offers insight into the behavior of existing debiasing techniques beyond robust accuracy alone.

Abstract

Group distributionally robust optimization, which aims to improve robust accuracies -- worst-group and unbiased accuracies -- is a prominent algorithm used to mitigate spurious correlations and address dataset bias. Although existing approaches have reported improvements in robust accuracies, these gains often come at the cost of average accuracy due to inherent trade-offs. To control this trade-off flexibly and efficiently, we propose a simple class-specific scaling strategy, directly applicable to existing debiasing algorithms with no additional training. We further develop an instance-wise adaptive scaling technique to alleviate this trade-off, even leading to improvements in both robust and average accuracies. Our approach reveals that a naïve ERM baseline matches or even outperforms the recent debiasing methods by simply adopting the class-specific scaling technique. Additionally, we introduce a novel unified metric that quantifies the trade-off between the two accuracies as a scalar value, allowing for a comprehensive evaluation of existing algorithms. By tackling the inherent trade-off and offering a performance landscape, our approach provides valuable insights into robust techniques beyond just robust accuracy. We validate the effectiveness of our framework through experiments across datasets in computer vision and natural language processing domains.

Re-evaluating Group Robustness via Adaptive Class-Specific Scaling

TL;DR

This work tackles the persistent trade-off between robust (group-wise) and average accuracies in group robustness methods. It introduces a training-free class-specific scaling as a post-processing step to control this trade-off, and extends it with instance-wise scaling that leverages feature clusters for per-example adjustments. A novel robust coverage metric is proposed to quantify the trade-off along the Pareto frontier, enabling a unified evaluation across methods. Empirical results across computer vision and NLP benchmarks show that simple RS/IRS can match or outperform several debiasing approaches with negligible training overhead, highlighting the potential of post-processing avenues for robust fairness. The framework provides practical guidance for selecting desirable performance points and offers insight into the behavior of existing debiasing techniques beyond robust accuracy alone.

Abstract

Group distributionally robust optimization, which aims to improve robust accuracies -- worst-group and unbiased accuracies -- is a prominent algorithm used to mitigate spurious correlations and address dataset bias. Although existing approaches have reported improvements in robust accuracies, these gains often come at the cost of average accuracy due to inherent trade-offs. To control this trade-off flexibly and efficiently, we propose a simple class-specific scaling strategy, directly applicable to existing debiasing algorithms with no additional training. We further develop an instance-wise adaptive scaling technique to alleviate this trade-off, even leading to improvements in both robust and average accuracies. Our approach reveals that a naïve ERM baseline matches or even outperforms the recent debiasing methods by simply adopting the class-specific scaling technique. Additionally, we introduce a novel unified metric that quantifies the trade-off between the two accuracies as a scalar value, allowing for a comprehensive evaluation of existing algorithms. By tackling the inherent trade-off and offering a performance landscape, our approach provides valuable insights into robust techniques beyond just robust accuracy. We validate the effectiveness of our framework through experiments across datasets in computer vision and natural language processing domains.

Paper Structure

This paper contains 47 sections, 11 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The scatter plots illustrate trade-offs between robust and average accuracies of existing algorithms with ResNet-18 on CelebA. We visualize the results from multiple runs of each algorithm and present the relationship between the two accuracies. The lines denote the linear regression results of individual algorithms and $r$ in the legend indicates the Pearson coefficient correlation.
  • Figure 2: Comparison between the baseline ERM and existing debiasing approaches with ResNet-50 on CelebA. Existing works have improved robust accuracy substantially compared to ERM, but our robust scaling strategies such as RS and IRS enable ERM to catch up with or even outperform them without further training.
  • Figure 3: The relation between the robust and average accuracies obtained by varying the class-specific scaling factor $\mathbf{s}$ with ERM on CelebA. The black marker denotes the original point, where the uniform scaling is applied.
  • Figure 4: The robust-average accuracy trade-off curves of various baselines on the CelebA dataset. The black marker denotes the original point, where the uniform scaling is applied.
  • Figure 5: Sensitivity analysis with respect to the number of clusters in IRS on Waterbirds. The tendency of the robust coverage in the validation split (orange) is similar with the robust accuracy in the test split (blue).
  • ...and 2 more figures