Table of Contents
Fetching ...

Jacobian Descent for Multi-Objective Optimization

Pierre Quinton, Valérian Rey

TL;DR

Jacobian descent (JD) addresses multi-objective optimization in deep learning by updating parameters through an aggregator applied to the Jacobian of a vector-valued loss, avoiding naive scalarization. The proposed UPGrad aggregator is non-conflicting, linear under scaling, and weighted, enabling convergence guarantees to the Pareto front in smooth convex settings, and supporting stochastic variants for large objective counts. The paper introduces instance-wise risk minimization (IWRM) as a novel learning paradigm, demonstrates empirical gains on simple image tasks, and provides a Gramian-based, memory-efficient implementation pathway for JD. These advances offer a scalable, principled framework for balancing conflicting objectives in tasks such as multi-task learning, adversarial training, and distributed optimization, with practical impact on how complex losses are optimized in neural networks.

Abstract

Many optimization problems require balancing multiple conflicting objectives. As gradient descent is limited to single-objective optimization, we introduce its direct generalization: Jacobian descent (JD). This algorithm iteratively updates parameters using the Jacobian matrix of a vector-valued objective function, in which each row is the gradient of an individual objective. While several methods to combine gradients already exist in the literature, they are generally hindered when the objectives conflict. In contrast, we propose projecting gradients to fully resolve conflict while ensuring that they preserve an influence proportional to their norm. We prove significantly stronger convergence guarantees with this approach, supported by our empirical results. Our method also enables instance-wise risk minimization (IWRM), a novel learning paradigm in which the loss of each training example is considered a separate objective. Applied to simple image classification tasks, IWRM exhibits promising results compared to the direct minimization of the average loss. Additionally, we outline an efficient implementation of JD using the Gramian of the Jacobian matrix to reduce time and memory requirements.

Jacobian Descent for Multi-Objective Optimization

TL;DR

Jacobian descent (JD) addresses multi-objective optimization in deep learning by updating parameters through an aggregator applied to the Jacobian of a vector-valued loss, avoiding naive scalarization. The proposed UPGrad aggregator is non-conflicting, linear under scaling, and weighted, enabling convergence guarantees to the Pareto front in smooth convex settings, and supporting stochastic variants for large objective counts. The paper introduces instance-wise risk minimization (IWRM) as a novel learning paradigm, demonstrates empirical gains on simple image tasks, and provides a Gramian-based, memory-efficient implementation pathway for JD. These advances offer a scalable, principled framework for balancing conflicting objectives in tasks such as multi-task learning, adversarial training, and distributed optimization, with practical impact on how complex losses are optimized in neural networks.

Abstract

Many optimization problems require balancing multiple conflicting objectives. As gradient descent is limited to single-objective optimization, we introduce its direct generalization: Jacobian descent (JD). This algorithm iteratively updates parameters using the Jacobian matrix of a vector-valued objective function, in which each row is the gradient of an individual objective. While several methods to combine gradients already exist in the literature, they are generally hindered when the objectives conflict. In contrast, we propose projecting gradients to fully resolve conflict while ensuring that they preserve an influence proportional to their norm. We prove significantly stronger convergence guarantees with this approach, supported by our empirical results. Our method also enables instance-wise risk minimization (IWRM), a novel learning paradigm in which the loss of each training example is considered a separate objective. Applied to simple image classification tasks, IWRM exhibits promising results compared to the direct minimization of the average loss. Additionally, we outline an efficient implementation of JD using the Gramian of the Jacobian matrix to reduce time and memory requirements.

Paper Structure

This paper contains 82 sections, 9 theorems, 60 equations, 12 figures, 7 tables, 3 algorithms.

Key Result

Proposition 1

Let $J\in\mathbb R^{m \times n}$. For any ${\boldsymbol{u}}\in{\mathbb R}^m$, $\pi_J(J^\top {\boldsymbol{u}})=J^\top {\boldsymbol{w}}$ with

Figures (12)

  • Figure 1: Aggregation of $J=[{\boldsymbol{g}}_1 \; {\boldsymbol{g}}_2]^\top\in\mathbb R^{2\times 2}$ by four different aggregators. The dual cone of $\{{\boldsymbol{g}}_1, {\boldsymbol{g}}_2\}$ is represented in green. (a) $\mathcal{A}_{\text{UPGrad}}$ projects ${\boldsymbol{g}}_1$ and ${\boldsymbol{g}}_2$ onto the dual cone and averages the results. (b) The mean $\mathcal{A}_{\text{Mean}}(J) = \frac{1}{2}({\boldsymbol{g}}_1 + {\boldsymbol{g}}_2)$ conflicts with ${\boldsymbol{g}}_1$. $\mathcal{A}_{\text{DualProj}}$ projects this mean onto the dual cone, so it lies on its boundary. $\mathcal{A}_{\text{MGDA}}(J)$ is almost orthogonal to ${\boldsymbol{g}}_2$ because of its larger norm.
  • Figure 2: Optimization metrics obtained with IWRM with 1024 training examples and a batch size of 32, averaged over 8 independent runs. The shaded area around each curve shows the estimated standard error of the mean over the 8 runs. Curves are smoothed for readability. Best viewed in color.
  • Figure 3: Optimization trajectories of various aggregators when optimizing ${\boldsymbol{f}}_{\mathrm{EWQ}}:x_1x_2^\top\mapsto x_1^2x_2^2^\top$ with JD. Colored dots represent the initial points. The trajectories start in red and evolve towards yellow.
  • Figure 4: Optimization trajectories of various aggregators when optimizing the convex quadratic form ${\boldsymbol{f}}_{\mathrm{CQF}}$ of (\ref{['eqn:f_cqf']}) with JD. Colored dots represent initial parameter values. The trajectories start in red and evolve towards yellow.
  • Figure 5: SVHN results.
  • ...and 7 more figures

Theorems & Definitions (23)

  • Definition 1: Non-conflicting
  • Definition 2: Linear under scaling
  • Definition 3: Weighted
  • Proposition 1
  • proof
  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • ...and 13 more