Table of Contents
Fetching ...

Data-Distill-Net: A Data Distillation Approach Tailored for Reply-based Continual Learning

Wenyang Liao, Quanziang Wang, Yichen Wu, Renzhen Wang, Deyu Meng

TL;DR

This work targets catastrophic forgetting in replay-based continual learning by introducing a data distillation framework that distills cross-task information into a learnable memory buffer. It introduces Data-Distill-Net (DDN), a lightweight hyper-network that generates soft labels for buffer samples, enabling global information distillation with reduced parameterization and avoiding heavy updates to the entire buffer. The approach uses a bi-level optimization to align gradients between the current and past data, and provides theoretical connections showing equivalence to gradient matching. Empirically, DDN improves average accuracy and reduces forgetting across online and offline settings when plugged into multiple replay-based baselines and across standard CL benchmarks, with strong performance especially at tight memory budgets. The method offers practical efficiency and plug-in compatibility, promoting more robust continual learning in resource-constrained environments.

Abstract

Replay-based continual learning (CL) methods assume that models trained on a small subset can also effectively minimize the empirical risk of the complete dataset. These methods maintain a memory buffer that stores a sampled subset of data from previous tasks to consolidate past knowledge. However, this assumption is not guaranteed in practice due to the limited capacity of the memory buffer and the heuristic criteria used for buffer data selection. To address this issue, we propose a new dataset distillation framework tailored for CL, which maintains a learnable memory buffer to distill the global information from the current task data and accumulated knowledge preserved in the previous memory buffer. Moreover, to avoid the computational overhead and overfitting risks associated with parameterizing the entire buffer during distillation, we introduce a lightweight distillation module that can achieve global information distillation solely by generating learnable soft labels for the memory buffer data. Extensive experiments show that, our method can achieve competitive results and effectively mitigates forgetting across various datasets. The source code will be publicly available.

Data-Distill-Net: A Data Distillation Approach Tailored for Reply-based Continual Learning

TL;DR

This work targets catastrophic forgetting in replay-based continual learning by introducing a data distillation framework that distills cross-task information into a learnable memory buffer. It introduces Data-Distill-Net (DDN), a lightweight hyper-network that generates soft labels for buffer samples, enabling global information distillation with reduced parameterization and avoiding heavy updates to the entire buffer. The approach uses a bi-level optimization to align gradients between the current and past data, and provides theoretical connections showing equivalence to gradient matching. Empirically, DDN improves average accuracy and reduces forgetting across online and offline settings when plugged into multiple replay-based baselines and across standard CL benchmarks, with strong performance especially at tight memory budgets. The method offers practical efficiency and plug-in compatibility, promoting more robust continual learning in resource-constrained environments.

Abstract

Replay-based continual learning (CL) methods assume that models trained on a small subset can also effectively minimize the empirical risk of the complete dataset. These methods maintain a memory buffer that stores a sampled subset of data from previous tasks to consolidate past knowledge. However, this assumption is not guaranteed in practice due to the limited capacity of the memory buffer and the heuristic criteria used for buffer data selection. To address this issue, we propose a new dataset distillation framework tailored for CL, which maintains a learnable memory buffer to distill the global information from the current task data and accumulated knowledge preserved in the previous memory buffer. Moreover, to avoid the computational overhead and overfitting risks associated with parameterizing the entire buffer during distillation, we introduce a lightweight distillation module that can achieve global information distillation solely by generating learnable soft labels for the memory buffer data. Extensive experiments show that, our method can achieve competitive results and effectively mitigates forgetting across various datasets. The source code will be publicly available.

Paper Structure

This paper contains 27 sections, 2 theorems, 23 equations, 6 figures, 5 tables, 2 algorithms.

Key Result

Theorem 5.1

Let $\nabla_\theta \mathcal{L}(f_\theta; \mathcal{M}^n)$ be the gradients w.r.t. $\theta$ on the parameterized memory buffer $\mathcal{M}^n$, and $\nabla_\theta \mathcal{L}(f_\theta; \mathcal{M}^{n-1} \cup \mathcal{T}^n)$ be the gradients w.r.t. $\theta$ on the current task $\mathcal{T}^n$ and the p

Figures (6)

  • Figure 1: Comparison between traditional dataset distillation (a) and our proposed method (b). At the $n$-th task $\mathcal{T}^n$, the distilled data $\mathcal{E}^n$ in the traditional approach contains only task-specific information from $\mathcal{T}^n$, while the samples in $\mathcal{M}^n$ fail to capture inter-task relationships as illustrate in (a). In contrast, our method distills samples for the memory buffer $\mathcal{M}^n$ from both the current task $\mathcal{T}^n$ and the previous buffer $\mathcal{M}^{n-1}$, thereby preserving correlations across tasks as shown in (b).
  • Figure 2: Overall framework of the proposed DDN. By generating refined soft labels for buffer samples, DDN enhances the classifier's training, thereby mitigating catastrophic forgetting of prior task knowledge. The training algorithms for the classifier and DDN are detailed in Algorithm \ref{['alg:whole process']} and Algorithm \ref{['alg:slg']}, respectively.
  • Figure 3: Under online CL, comparison of the predicted probability distributions for different class samples between ER and ER DDN(ours) on Split CIFAR-10 with buffer size M = 0.2K.
  • Figure 4: Under online CL, ACC and FM of different $\alpha$ and $\beta$ of our method on Split CIFAR-10 with buffer size M = 0.2K. (a) Effect of $\alpha$ with fixed $\beta=0.9$. (b) Effect of $\beta$ with fixed $\alpha=1.0$.
  • Figure 5: The normalized confusion matrix of ER and ER DDN based on Split CIFAR-10 with buffer size M=0.2K.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 5.1
  • Theorem 5.2
  • proof
  • proof