Jacobian Descent for Multi-Objective Optimization
Pierre Quinton, Valérian Rey
TL;DR
Jacobian descent (JD) addresses multi-objective optimization in deep learning by updating parameters through an aggregator applied to the Jacobian of a vector-valued loss, avoiding naive scalarization. The proposed UPGrad aggregator is non-conflicting, linear under scaling, and weighted, enabling convergence guarantees to the Pareto front in smooth convex settings, and supporting stochastic variants for large objective counts. The paper introduces instance-wise risk minimization (IWRM) as a novel learning paradigm, demonstrates empirical gains on simple image tasks, and provides a Gramian-based, memory-efficient implementation pathway for JD. These advances offer a scalable, principled framework for balancing conflicting objectives in tasks such as multi-task learning, adversarial training, and distributed optimization, with practical impact on how complex losses are optimized in neural networks.
Abstract
Many optimization problems require balancing multiple conflicting objectives. As gradient descent is limited to single-objective optimization, we introduce its direct generalization: Jacobian descent (JD). This algorithm iteratively updates parameters using the Jacobian matrix of a vector-valued objective function, in which each row is the gradient of an individual objective. While several methods to combine gradients already exist in the literature, they are generally hindered when the objectives conflict. In contrast, we propose projecting gradients to fully resolve conflict while ensuring that they preserve an influence proportional to their norm. We prove significantly stronger convergence guarantees with this approach, supported by our empirical results. Our method also enables instance-wise risk minimization (IWRM), a novel learning paradigm in which the loss of each training example is considered a separate objective. Applied to simple image classification tasks, IWRM exhibits promising results compared to the direct minimization of the average loss. Additionally, we outline an efficient implementation of JD using the Gramian of the Jacobian matrix to reduce time and memory requirements.
