Table of Contents
Fetching ...

Pareto Low-Rank Adapters: Efficient Multi-Task Learning with Preferences

Nikolaos Dimitriadis, Pascal Frossard, Francois Fleuret

TL;DR

PaLoRA tackles the scalability and reliability limitations of Pareto Front Learning by introducing per-task low-rank adapters and a deterministic, annealed preference schedule. The method preserves a shared backbone for general features while adapters capture task-specific cues, enabling efficient exploration of the Pareto Front with reduced memory overhead. The authors prove a universal-approximation property and demonstrate superior performance and faster convergence across multi-label, regression, and scene-understanding benchmarks, including CityScapes and NYUv2, while enabling continuous PF expansion from pretrained checkpoints. This work offers a practical path to scalable, flexible multi-task modeling in real-world systems.

Abstract

Multi-task trade-offs in machine learning can be addressed via Pareto Front Learning (PFL) methods that parameterize the Pareto Front (PF) with a single model. PFL permits to select the desired operational point during inference, contrary to traditional Multi-Task Learning (MTL) that optimizes for a single trade-off decided prior to training. However, recent PFL methodologies suffer from limited scalability, slow convergence, and excessive memory requirements, while exhibiting inconsistent mappings from preference to objective space. We introduce PaLoRA, a novel parameter-efficient method that addresses these limitations in two ways. First, we augment any neural network architecture with task-specific low-rank adapters and continuously parameterize the PF in their convex hull. Our approach steers the original model and the adapters towards learning general and task-specific features, respectively. Second, we propose a deterministic sampling schedule of preference vectors that reinforces this division of labor, enabling faster convergence and strengthening the validity of the mapping from preference to objective space throughout training. Our experiments show that PaLoRA outperforms state-of-the-art MTL and PFL baselines across various datasets, scales to large networks, reducing the memory overhead $23.8-31.7$ times compared with competing PFL baselines in scene understanding benchmarks.

Pareto Low-Rank Adapters: Efficient Multi-Task Learning with Preferences

TL;DR

PaLoRA tackles the scalability and reliability limitations of Pareto Front Learning by introducing per-task low-rank adapters and a deterministic, annealed preference schedule. The method preserves a shared backbone for general features while adapters capture task-specific cues, enabling efficient exploration of the Pareto Front with reduced memory overhead. The authors prove a universal-approximation property and demonstrate superior performance and faster convergence across multi-label, regression, and scene-understanding benchmarks, including CityScapes and NYUv2, while enabling continuous PF expansion from pretrained checkpoints. This work offers a practical path to scalable, flexible multi-task modeling in real-world systems.

Abstract

Multi-task trade-offs in machine learning can be addressed via Pareto Front Learning (PFL) methods that parameterize the Pareto Front (PF) with a single model. PFL permits to select the desired operational point during inference, contrary to traditional Multi-Task Learning (MTL) that optimizes for a single trade-off decided prior to training. However, recent PFL methodologies suffer from limited scalability, slow convergence, and excessive memory requirements, while exhibiting inconsistent mappings from preference to objective space. We introduce PaLoRA, a novel parameter-efficient method that addresses these limitations in two ways. First, we augment any neural network architecture with task-specific low-rank adapters and continuously parameterize the PF in their convex hull. Our approach steers the original model and the adapters towards learning general and task-specific features, respectively. Second, we propose a deterministic sampling schedule of preference vectors that reinforces this division of labor, enabling faster convergence and strengthening the validity of the mapping from preference to objective space throughout training. Our experiments show that PaLoRA outperforms state-of-the-art MTL and PFL baselines across various datasets, scales to large networks, reducing the memory overhead times compared with competing PFL baselines in scene understanding benchmarks.
Paper Structure (34 sections, 2 theorems, 14 equations, 15 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 2 theorems, 14 equations, 15 figures, 5 tables, 1 algorithm.

Key Result

Theorem 2

Let $f_t: \mathcal{X}\times\Theta\mapsto \mathcal{Y}$ be a family of continuous mappings, where $t=1, \dots,T\xspace$, and $\mathcal{X}\subset\mathbb{R}^D$ is compact. Then, $\forall \epsilon>0$, there exists a ReLU multi-layer perceptron $f$ with three different weight parameterizations $\bm{\theta

Figures (15)

  • Figure 1: Conceptual illustration of the architecture. Each layer consists of the base network's weight matrix $\bm{W}$ and $3$ low-rank adapters $\{(\bm{A}_t,\bm{B}_t)\}_{t=1}^3$. During training, we sample preference $\bm{\lambda}=[\textcolor{blue}{\lambda_1},\textcolor{red}{\lambda_2},\textcolor{green}{\lambda_3}] \sim \Delta_3$, each layer's weights are formed by the weighted sum of the original matrix and the tasks' low-rank adapters. The overall loss uses the same$\bm{\lambda}$ to weigh the task losses, steering each adapter to learn task-specific features and the shared backbone to learn a general representation.
  • Figure 2: Random vs deterministic preference schedules for two tasks as a function of time. Each dashed line corresponds to a different batch, bottom is beginning of training and top end of training. For each batch, preferences $\bm{\lambda}=[\lambda, 1-\lambda]$ are drawn; we only show $\lambda$. Randomly sampling (a) $M=1$ ray per batch or (b) multiple $(M>1)$ rays dimitriadis2023pareto can lead to poor mappings from preference to objective space due to lack of exploration and tightly clustered sampled rays. Instead, our (c) proposed deterministic schedule resolves these issues and (d) our temperature annealing, focusing progressively more to learning task-specific features, can lead to wider Pareto Fronts.
  • Figure 3: Experimental results. (a) PaLoRA outperforms MTL baselines and achieves higher Hypervolume while requiring less memory vs other PFL algorithms, (b) constructs a wide Pareto Front for benchmarks with two classification and one regression tasks. (c) Even for 7 tasks, PaLoRA showcases fast convergence while PaMaL is slow due to $7\times$ increase in parameter count.
  • Figure 4: Pareto Front Expansion. Given a checkpoint $\bm{\theta}_0$, marked as $\mathbin{\vcenter{\hbox{$\bullet$}}}$, PaLoRA expands locally the Pareto Front in its neighborhood $\mathcal{N}(\bm{\theta}_0)$. (Left) The scaling $\alpha$ of \ref{['eq:lora output']} determines the functional diversity of the final MultiMNIST Front. (Middle) The epoch-by-epoch progression of the MultiMNIST Pareto Front expansion. (Right) The final CityScapes Pareto Front.
  • Figure 5: PaLoRA satisfies both PFL goals: high controllability coupled with superior performance. PaLoRA converges faster than state-of-the-art PFL method in PaMaL and on par or better to MTL methods. It is also more consistent in terms of the validity of the Pareto Front across epochs, while the number of points in the Pareto Front in PaMaL varies a lot.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Definition 1: Pareto Optimality
  • Theorem 2
  • Theorem 3
  • proof