Table of Contents
Fetching ...

HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning

Xuerui Zhang, Xuehao Wang, Zhan Zhuang, Linglan Zhao, Ziyue Li, Xinmin Zhang, Zhihuan Song, Yu Zhang

Abstract

Lifelong learning aims to preserve knowledge acquired from previous tasks while incorporating knowledge from a sequence of new tasks. However, most prior work explores only streams of homogeneous tasks (\textit{e.g.}, only classification tasks) and neglects the scenario of learning across heterogeneous tasks that possess different structures of outputs. In this work, we formalize this broader setting as lifelong heterogeneous learning (LHL). Departing from conventional lifelong learning, the task sequence of LHL spans different task types, and the learner needs to retain heterogeneous knowledge for different output space structures. To instantiate the LHL, we focus on LHL in the context of dense prediction (LHL4DP), a realistic and challenging scenario. To this end, we propose the Heterogeneity-Aware Distillation (HAD) method, an exemplar-free approach that preserves previously gained heterogeneous knowledge by self-distillation in each training phase. The proposed HAD comprises two complementary components, including a distribution-balanced heterogeneity-aware distillation loss to alleviate the global imbalance of prediction distribution and a salience-guided heterogeneity-aware distillation loss that concentrates learning on informative edge pixels extracted with the Sobel operator. Extensive experiments demonstrate that the proposed HAD method significantly outperforms existing methods in this new scenario.

HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning

Abstract

Lifelong learning aims to preserve knowledge acquired from previous tasks while incorporating knowledge from a sequence of new tasks. However, most prior work explores only streams of homogeneous tasks (\textit{e.g.}, only classification tasks) and neglects the scenario of learning across heterogeneous tasks that possess different structures of outputs. In this work, we formalize this broader setting as lifelong heterogeneous learning (LHL). Departing from conventional lifelong learning, the task sequence of LHL spans different task types, and the learner needs to retain heterogeneous knowledge for different output space structures. To instantiate the LHL, we focus on LHL in the context of dense prediction (LHL4DP), a realistic and challenging scenario. To this end, we propose the Heterogeneity-Aware Distillation (HAD) method, an exemplar-free approach that preserves previously gained heterogeneous knowledge by self-distillation in each training phase. The proposed HAD comprises two complementary components, including a distribution-balanced heterogeneity-aware distillation loss to alleviate the global imbalance of prediction distribution and a salience-guided heterogeneity-aware distillation loss that concentrates learning on informative edge pixels extracted with the Sobel operator. Extensive experiments demonstrate that the proposed HAD method significantly outperforms existing methods in this new scenario.

Paper Structure

This paper contains 26 sections, 11 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Vanilla training under LHL4DP. To assess the impact of catastrophic forgetting, we shuffle the learning sequences of three DP tasks. Each figure illustrates how the performance of a given task varies as the training phase proceeds, where the number in the horizontal axis denotes the task index in each sequence of three DP tasks. The performance metric is indicated above each column. The symbol $\uparrow$ ($\downarrow$) signifies that a higher (lower) value denotes better performance.
  • Figure 2: The training pipeline of the proposed HAD method in the $t$-th training phase. The HAD method uses the distribution-balanced and salience-guided distillation loss to mitigate forgetting of previous tasks $\mathcal{T}_j\ (j<t)$, all of which are calculated on the pseudo-labels generated by the frozen teacher model $\mathcal{F}^{t-1}_{j}$. Adapting to the new task $\mathcal{T}_t$ is achieved by the task-specific loss function $\mathcal{L}_{\mathrm{new}}$.
  • Figure 3: An illustration of the distribution imbalance in pseudo-labels. The number of pixels in the semantic segmentation task is counted per class. In the depth estimation task, we divide the range of pseudo-labels given by the teacher model into ten equal intervals, each of which is a group, and then the ten groups are sorted based on the number of pixels in each group.
  • Figure 4: The comparison between training separate models and the proposed HAD. Each figure illustrates the performance improvement of the HAD method in the LHL4DP scenario of a given task. The symbol $\uparrow$ ($\downarrow$) signifies that a higher (lower) value denotes better performance.
  • Figure 5: Visualization of the raw data (left), the gradient magnitude map of its prediction (middle), and the gradient magnitude map of the loss map (right).
  • ...and 2 more figures