Table of Contents
Fetching ...

Disentangling and Mitigating the Impact of Task Similarity for Continual Learning

Naoki Hiratani

TL;DR

The paper addresses how task similarity in input features and readout patterns governs transfer and forgetting in continual learning. It develops a linear teacher–student model with a low-dimensional latent space to analytically disentangle feature-readout effects and to study gating and Fisher-information-based regularization as mitigation strategies. Key findings show that high feature similarity with low readout similarity can cause negative transfer and poor retention, while weight regularization in the Fisher metric provides robust retention without harming transfer; adaptive gating can further improve transfer. Numerical experiments on a latent-permuted MNIST setting corroborate the theory and offer guidance on when continual learning is difficult and how to mitigate it.

Abstract

Continual learning of partially similar tasks poses a challenge for artificial neural networks, as task similarity presents both an opportunity for knowledge transfer and a risk of interference and catastrophic forgetting. However, it remains unclear how task similarity in input features and readout patterns influences knowledge transfer and forgetting, as well as how they interact with common algorithms for continual learning. Here, we develop a linear teacher-student model with latent structure and show analytically that high input feature similarity coupled with low readout similarity is catastrophic for both knowledge transfer and retention. Conversely, the opposite scenario is relatively benign. Our analysis further reveals that task-dependent activity gating improves knowledge retention at the expense of transfer, while task-dependent plasticity gating does not affect either retention or transfer performance at the over-parameterized limit. In contrast, weight regularization based on the Fisher information metric significantly improves retention, regardless of task similarity, without compromising transfer performance. Nevertheless, its diagonal approximation and regularization in the Euclidean space are much less robust against task similarity. We demonstrate consistent results in a permuted MNIST task with latent variables. Overall, this work provides insights into when continual learning is difficult and how to mitigate it.

Disentangling and Mitigating the Impact of Task Similarity for Continual Learning

TL;DR

The paper addresses how task similarity in input features and readout patterns governs transfer and forgetting in continual learning. It develops a linear teacher–student model with a low-dimensional latent space to analytically disentangle feature-readout effects and to study gating and Fisher-information-based regularization as mitigation strategies. Key findings show that high feature similarity with low readout similarity can cause negative transfer and poor retention, while weight regularization in the Fisher metric provides robust retention without harming transfer; adaptive gating can further improve transfer. Numerical experiments on a latent-permuted MNIST setting corroborate the theory and offer guidance on when continual learning is difficult and how to mitigate it.

Abstract

Continual learning of partially similar tasks poses a challenge for artificial neural networks, as task similarity presents both an opportunity for knowledge transfer and a risk of interference and catastrophic forgetting. However, it remains unclear how task similarity in input features and readout patterns influences knowledge transfer and forgetting, as well as how they interact with common algorithms for continual learning. Here, we develop a linear teacher-student model with latent structure and show analytically that high input feature similarity coupled with low readout similarity is catastrophic for both knowledge transfer and retention. Conversely, the opposite scenario is relatively benign. Our analysis further reveals that task-dependent activity gating improves knowledge retention at the expense of transfer, while task-dependent plasticity gating does not affect either retention or transfer performance at the over-parameterized limit. In contrast, weight regularization based on the Fisher information metric significantly improves retention, regardless of task similarity, without compromising transfer performance. Nevertheless, its diagonal approximation and regularization in the Euclidean space are much less robust against task similarity. We demonstrate consistent results in a permuted MNIST task with latent variables. Overall, this work provides insights into when continual learning is difficult and how to mitigate it.
Paper Structure (42 sections, 108 equations, 9 figures)

This paper contains 42 sections, 108 equations, 9 figures.

Figures (9)

  • Figure 1: Transfer and retention performance of the vanilla model. (A) Schematic of task similarity. (B) Illustration of $\Delta \epsilon_{TF}$ and $\Delta \epsilon_{RT}$. Red and blue lines represent the error on task 1 and task 2, respectively. Here, the model was trained on task 1 for 100 iterations and then trained on task 2 for another 100 iterations. (C, D) Transfer performance under various task similarity. Points in panel C are numerical results (the means and the standard deviations over ten random seeds), while solid lines are analytical results (Eq. \ref{['eq_epsilon_vanilla']}). (E-H) Retention performance under various task similarity. Panel H magnifies the $0.9 \leq \rho_b \leq 1.0$ region of panel F, and the white dashed line in panel H represents local minima/maxima.
  • Figure 2: Random task-dependent activity gating model. (A) Knowledge transfer performance under $\rho_a = 1.0$. The gating level $\alpha$ is defined as the fraction of active input neurons (i.e., $\alpha = \Pr [g_i = 1]$). (B) The transfer performance under the optimal gating level $\alpha^* = \min \{\tfrac{\rho_b}{\rho_a}, 1\}$. (C) Retention performance under $\rho_a = 1.0$. (D) Average transfer and retention performance over uniform prior on $0 \leq \rho_a, \rho_b \leq 1$. Horizontal dashed lines are the performance of the vanilla model, while solid lines are the performance of the random gating model. Points are numerical estimations.
  • Figure 3: Transfer and retention performance of the adaptive activity gating (A, B), random plasticity gating (C), and input soft-thresholding (D) models. Dashed and solid lines in panels A and B represent the performance of the random and adaptive activity gating models, respectively.
  • Figure 4: Performance of weight regularization in Euclidean metric. (A,B) Transfer (A) and retention (B) performance. The amplitude of the weight regularization scales with $\tfrac{1}{\gamma} - 1$. (C) Regularizer coefficient $\gamma$ that optimizes the retention performance. (D) Average performance over uniform task similarity distribution in $0 \leq \rho_a, \rho_b \leq 1$. Horizontal dashed lines are the performance of the vanilla model.
  • Figure 5: Weight regularization in the Fisher information metric. (A,B) The retention performance under various task similarities and regularizer coefficients. (C,D) Average transfer and retention performance under the regularization with the exact Fisher information metric (C) and its diagonal approximation (D).
  • ...and 4 more figures