Table of Contents
Fetching ...

Optimizing importance weighting in the presence of sub-population shifts

Floris Holstege, Bram Wouters, Noud van Giersbergen, Cees Diks

TL;DR

The paper tackles distribution shifts, particularly sub-population shifts, by challenging the common practice of using likelihood-ratio weights in importance weighting. It develops a bi-level optimization framework to jointly optimize group-based weights and model parameters, highlighting a bias-variance trade-off that governs finite-sample performance. The authors analytically characterize optimal weights in a linear-regression toy model and demonstrate empirically that optimized weights improve generalization for last-layer retraining across vision and NLP benchmarks, often increasing robustness to hyperparameters. They provide a practical, open-source implementation and discuss limitations and avenues for extending optimization-to-full-model regimes and applicability to unknown test distributions. The work advances sub-population shift handling by turning weight selection into a data-driven, gradient-based optimization problem that complements existing importance weighting methods.

Abstract

A distribution shift between the training and test data can severely harm performance of machine learning models. Importance weighting addresses this issue by assigning different weights to data points during training. We argue that existing heuristics for determining the weights are suboptimal, as they neglect the increase of the variance of the estimated model due to the finite sample size of the training data. We interpret the optimal weights in terms of a bias-variance trade-off, and propose a bi-level optimization procedure in which the weights and model parameters are optimized simultaneously. We apply this optimization to existing importance weighting techniques for last-layer retraining of deep neural networks in the presence of sub-population shifts and show empirically that optimizing weights significantly improves generalization performance.

Optimizing importance weighting in the presence of sub-population shifts

TL;DR

The paper tackles distribution shifts, particularly sub-population shifts, by challenging the common practice of using likelihood-ratio weights in importance weighting. It develops a bi-level optimization framework to jointly optimize group-based weights and model parameters, highlighting a bias-variance trade-off that governs finite-sample performance. The authors analytically characterize optimal weights in a linear-regression toy model and demonstrate empirically that optimized weights improve generalization for last-layer retraining across vision and NLP benchmarks, often increasing robustness to hyperparameters. They provide a practical, open-source implementation and discuss limitations and avenues for extending optimization-to-full-model regimes and applicability to unknown test distributions. The work advances sub-population shift handling by turning weight selection into a data-driven, gradient-based optimization problem that complements existing importance weighting methods.

Abstract

A distribution shift between the training and test data can severely harm performance of machine learning models. Importance weighting addresses this issue by assigning different weights to data points during training. We argue that existing heuristics for determining the weights are suboptimal, as they neglect the increase of the variance of the estimated model due to the finite sample size of the training data. We interpret the optimal weights in terms of a bias-variance trade-off, and propose a bi-level optimization procedure in which the weights and model parameters are optimized simultaneously. We apply this optimization to existing importance weighting techniques for last-layer retraining of deep neural networks in the presence of sub-population shifts and show empirically that optimizing weights significantly improves generalization performance.

Paper Structure

This paper contains 25 sections, 54 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: The optimal choice $p^*_n,$ given by Equation \ref{['eq:optimal_weight_regression']}, as a function of the training dataset size $n,$ for varying feature dimension $d$ (panel (a)) and training distribution $p_{\mathrm{tr}}$ (panel (b)). Other parameters are fixed at $a_1 = 1, a_0 = 0, \sigma^2 = 1, p_{\mathrm{te}} = 0.5.$ Note that the heuristic choice for $p$ would be at 0.5. Also note that Equation \ref{['eq:optimal_weight_regression']} is derived approximately for large $n$ (hence the dashed lines for small $n$).
  • Figure 2: Estimated generalization performance (averaged over 5 runs) of standard weights choice (squares) versus optimized weights (triangles). Error bars reflect the paired-sample 90% confidence interval of the difference.
  • Figure 3: Estimated generalization performance (averaged over 5 runs) of standard weights choice (squares) versus optimized weights (triangles) for GW-ERM as a function of the sizes of the training and validation set. Error bars reflect the paired-sample 90% confidence interval of the difference.
  • Figure 4: Estimated generalization performance (averaged over 5 runs) of standard weights choice (squares) versus optimized weights (triangles) for GW-ERM as a function of the L1 regularization parameter. Error bars reflect the paired-sample 90% confidence interval of the difference.
  • Figure 5: Illustration of the bias-variance trade-off for $p_{\mathrm{tr}} = 0.9, p_{\mathrm{te}}=0.5, a_1=1, a_0=0, \sigma^2=1, \gamma=1$, and different values of $n$ and $d$. The simulated MSE is averaged over 1,000 runs. Error bars reflect the 95% Confidence interval.
  • ...and 2 more figures