Elastic Multi-Gradient Descent for Parallel Continual Learning

Fan Lyu; Wei Feng; Yuepan Li; Qing Sun; Fanhua Shang; Liang Wan; Liang Wang

Elastic Multi-Gradient Descent for Parallel Continual Learning

Fan Lyu, Wei Feng, Yuepan Li, Qing Sun, Fanhua Shang, Liang Wan, Liang Wang

TL;DR

This paper introduces Parallel Continual Learning (PCL), where multiple data streams with distinct tasks arrive asynchronously and are trained in parallel. It reframes PCL as a dynamic multi-objective optimization problem and proposes Elastic Multi-Gradient Descent (EMGD) to enforce Pareto-descent directions via task-specific elastic factors $\sigma_i$, ensuring both new-task learning and old-task retention. A gradient-guided memory editing mechanism further reduces interference by steering replay samples toward the EMGD descent direction. Empirical results on PS-EMNIST, PS-CIFAR-100, and PS-ImageNet-TINY show that EMGD improves the final average accuracy $A_{\bar{e}}$ and minimizes forgetting $F_{\bar{e}}$ compared with MTL, SCL, and prior PCL baselines, highlighting its practical impact for real-time, multi-task learning.

Abstract

The goal of Continual Learning (CL) is to continuously learn from new data streams and accomplish the corresponding tasks. Previously studied CL assumes that data are given in sequence nose-to-tail for different tasks, thus indeed belonging to Serial Continual Learning (SCL). This paper studies the novel paradigm of Parallel Continual Learning (PCL) in dynamic multi-task scenarios, where a diverse set of tasks is encountered at different time points. PCL presents challenges due to the training of an unspecified number of tasks with varying learning progress, leading to the difficulty of guaranteeing effective model updates for all encountered tasks. In our previous conference work, we focused on measuring and reducing the discrepancy among gradients in a multi-objective optimization problem, which, however, may still contain negative transfers in every model update. To address this issue, in the dynamic multi-objective optimization problem, we introduce task-specific elastic factors to adjust the descent direction towards the Pareto front. The proposed method, called Elastic Multi-Gradient Descent (EMGD), ensures that each update follows an appropriate Pareto descent direction, minimizing any negative impact on previously learned tasks. To balance the training between old and new tasks, we also propose a memory editing mechanism guided by the gradient computed using EMGD. This editing process updates the stored data points, reducing interference in the Pareto descent direction from previous tasks. Experiments on public datasets validate the effectiveness of our EMGD in the PCL setting.

Elastic Multi-Gradient Descent for Parallel Continual Learning

TL;DR

, ensuring both new-task learning and old-task retention. A gradient-guided memory editing mechanism further reduces interference by steering replay samples toward the EMGD descent direction. Empirical results on PS-EMNIST, PS-CIFAR-100, and PS-ImageNet-TINY show that EMGD improves the final average accuracy

and minimizes forgetting

compared with MTL, SCL, and prior PCL baselines, highlighting its practical impact for real-time, multi-task learning.

Abstract

Paper Structure (29 sections, 2 theorems, 27 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 27 equations, 9 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Multi-Task Learning
Continual Learning
Parallel Continual Learning
Problem Definition: Parallel Continual Learning
Dynamic multi-task learning with rehearsal
Elastic Multi-Gradient Descent
Elastic constraint for steepest gradient method
Evaluating elastic factor
Gradient-guided Memory Editing
Convergence analysis
Discussion
The limitation of MGDA
Comparing MGDA with EMGD
...and 14 more sections

Key Result

Lemma 1

Let $\mathbf{d}^*(\bm{\theta})$ and $\alpha^*(\bm{\theta})$ be the solution of Eq. eq:edga under the paramter $\bm{\theta}$. $\mathbf{d}^*(\bm{\theta})$ and $\alpha^*(\bm{\theta})$ hold the features: (1) if $\bm{\theta}$ is Pareto critical, it has $\mathbf{d}^*(\bm{\theta})=\mathbf{0}$ and $\alpha^*

Figures (9)

Figure 1: Comparisons of Multi-Task Learning (MTL), Serial Continual Learning (SCL) and Parallel Continual Learning (PCL). (a) MTL relies on a fixed number of different tasks without adaptive incremental learning for new tasks. (b)SCL learns from sequential multiple tasks, where new tasks have to wait for the completion of previous training. (c) PCL enables training of multiple tasks based on their access time, allowing new tasks to be incorporated immediately.
Figure 2: Two synthetic functions are optimized in the PCL setting, where the new task access at iter 500.
Figure 3: The schematic of the proposed method (one-step). Together with the memory stream, each activated data stream is put into the model and used to compute the task-specific gradient. These gradients are sent to EMGD to obtain a Pareto descent direction. Then, we use the optimal gradient from EMGD to guide the editing on the memory.
Figure 4: Compared to MGDA sener2018multi, EMGD can control the ratio of $\lambda_2$ and $\lambda_1$ in two task scenario. The ratio highly depends on the value of $\sigma$.
Figure 5: Performance comparisons with different memory size on PS-CIFAR-100.
...and 4 more figures

Theorems & Definitions (4)

Definition 1: Parallel Continual Learning
Definition 2: Pareto Optimality
Lemma 1
Theorem 1

Elastic Multi-Gradient Descent for Parallel Continual Learning

TL;DR

Abstract

Elastic Multi-Gradient Descent for Parallel Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (4)