Table of Contents
Fetching ...

Learning Pareto manifolds in high dimensions: How can regularization help?

Tobias Wegel, Filip Kovačević, Alexandru Ţifrea, Fanny Yang

TL;DR

This work tackles learning Pareto fronts in high-dimensional multi-objective learning with limited labeled data. It identifies the insufficiency of naive direct-regularization on scalarized objectives and introduces a two-stage estimator that first learns distributional parameters (potentially using unlabeled data) and then optimizes a scalarized MOL objective to recover the Pareto set. The authors establish upper bounds that propagate parameter estimation errors to Pareto-point errors and prove minimax lower bounds, showing the two-stage method is minimax-optimal under Lipschitz identifiability. The approach yields strong results in examples like multi-distribution sparse regression and fairness-risk trade-offs, validated by experiments with ensembles and hypernetworks that approximate the Pareto set. Overall, the paper offers a principled, theory-backed framework for efficient Pareto learning in high dimensions with practical implications for robust, fair, and multi-objective ML systems.

Abstract

Simultaneously addressing multiple objectives is becoming increasingly important in modern machine learning. At the same time, data is often high-dimensional and costly to label. For a single objective such as prediction risk, conventional regularization techniques are known to improve generalization when the data exhibits low-dimensional structure like sparsity. However, it is largely unexplored how to leverage this structure in the context of multi-objective learning (MOL) with multiple competing objectives. In this work, we discuss how the application of vanilla regularization approaches can fail, and propose a two-stage MOL framework that can successfully leverage low-dimensional structure. We demonstrate its effectiveness experimentally for multi-distribution learning and fairness-risk trade-offs.

Learning Pareto manifolds in high dimensions: How can regularization help?

TL;DR

This work tackles learning Pareto fronts in high-dimensional multi-objective learning with limited labeled data. It identifies the insufficiency of naive direct-regularization on scalarized objectives and introduces a two-stage estimator that first learns distributional parameters (potentially using unlabeled data) and then optimizes a scalarized MOL objective to recover the Pareto set. The authors establish upper bounds that propagate parameter estimation errors to Pareto-point errors and prove minimax lower bounds, showing the two-stage method is minimax-optimal under Lipschitz identifiability. The approach yields strong results in examples like multi-distribution sparse regression and fairness-risk trade-offs, validated by experiments with ensembles and hypernetworks that approximate the Pareto set. Overall, the paper offers a principled, theory-backed framework for efficient Pareto learning in high dimensions with practical implications for robust, fair, and multi-objective ML systems.

Abstract

Simultaneously addressing multiple objectives is becoming increasingly important in modern machine learning. At the same time, data is often high-dimensional and costly to label. For a single objective such as prediction risk, conventional regularization techniques are known to improve generalization when the data exhibits low-dimensional structure like sparsity. However, it is largely unexplored how to leverage this structure in the context of multi-objective learning (MOL) with multiple competing objectives. In this work, we discuss how the application of vanilla regularization approaches can fail, and propose a two-stage MOL framework that can successfully leverage low-dimensional structure. We demonstrate its effectiveness experimentally for multi-distribution learning and fairness-risk trade-offs.

Paper Structure

This paper contains 37 sections, 21 theorems, 165 equations, 4 figures, 2 algorithms.

Key Result

Proposition 1

Let $G_k:=\sup_{\boldsymbol{\lambda}\in\Delta^K}\left\|\nabla_\vartheta\mathcal{L}_k(\vartheta_{\boldsymbol{\lambda}})\right\|_2$, assume $\vartheta\mapsto \mathcal{L}_k(\vartheta)$ is $\nu_k$-smooth, and define $\varepsilon_{\max} := \max_{k\in[K],\boldsymbol{\lambda}\in\Delta^K}\varepsilon(G_k,\nu for $\widehat{\mathfrak{F}} = \{\boldsymbol{\mathcal{L}}(\widehat{\vartheta}_{\boldsymbol{\lambda}}

Figures (4)

  • Figure 1: The parameter space $\mathbb{R}^m$ (left) parameterizes the hypothesis set $\mathcal{F}$ and contains the population Pareto set $\{\vartheta_{\boldsymbol{\lambda}} | \boldsymbol{\lambda}\in\Delta^{K}\}$ (gray line), and the set of the empirical estimators $\{\widehat{\vartheta}_{\boldsymbol{\lambda}} | \boldsymbol{\lambda}\in \Delta^K\}$ (dashed blue line). The right figure depicts the region of all values that can be obtained by $\boldsymbol{\mathcal{L}}(\vartheta)$ for some $\vartheta$ (gray shaded area), the population Pareto front $\mathfrak{F}$ (gray line) and estimated Pareto front (dashed blue line).
  • Figure 2: Illustration of the intuition for \ref{['prop:InsufficiencyPluginRegularization', 'prop:NecessityUnlabeledData']} in linear regression with squared loss and linear scalarization: For any $v\in B_2^d$, we can find covariance matrices $\Sigma_1,\Sigma_2$ with constrained condition number, and $1$-sparse $\beta_1,\beta_2$, so that the minimizer $\vartheta_{\boldsymbol{\lambda}}$ of \ref{['eq:scalarization']} satisfies $v = \vartheta_{\boldsymbol{\lambda}}$. This makes learning with direct regularization and without enough unlabeled data infeasible.
  • Figure 3: The important roles of both regularization and additional unlabeled data for \ref{['ex:multiple-linear-regression']} illustrated on an intuitive level \ref{['fig:covariance-intuition']}, and by evaluating the excess scalarized loss in simulations \ref{['fig:covariance-simulation']}: Increasing sparsity together with appropriate regularization improves the estimate of the parameters $\beta_k$, while an increasing number of unlabeled datapoints $N_k$ improves the estimate of the covariance matrices $\Sigma_k$, both improving the estimation of the Pareto front. \ref{['fig:multiple_regression']}: Pareto fronts for two sparse linear regression problems, using direct regularization and the two-stage approach (\ref{['subsec:multi-dist-regression']}). We also plot the hypernetwork implementation.
  • Figure 4: Pareto fronts on test data and their estimates using direct regularization (orange) and our method (blue) for the fairness experiments described in \ref{['subsec:fairness-risk-experiment', 'subsec:fairness-datasets']}, using data from Redmond2002CommunitiesBecker1996AdultJeong2022fairnessAlghamdi2022beyond. Each experiment is repeated 20 times and we plot the results (transparent), as well as their average (thick lines).

Theorems & Definitions (41)

  • Example 1: Sparse linear regression
  • Example 2: Fairness and risk
  • Definition 1: Pareto-optimality
  • Definition 2
  • Proposition 1
  • Example 3: Sparse fixed-design linear regression
  • Proposition 2: Insufficiency of direct regularization
  • Definition 3
  • Proposition 3
  • Theorem 1
  • ...and 31 more