Table of Contents
Fetching ...

Step Out and Seek Around: On Warm-Start Training with Incremental Data

Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jose M. Alvarez

TL;DR

The paper addresses learning from streaming data under fixed training and storage costs, where warm-starting from a prior model can hurt generalization. It introduces CKCA, a continuous-learning framework with feature-space regularization (FeatReg) and adaptive knowledge distillation (AdaKD) to mitigate forgetting while leveraging new data. On ImageNet with 10 data-splits and no access to old data, CKCA achieves up to 8.39 percentage points improvement over warm-start and about 6.24 points over prior state-of-the-art, with further gains when combined with iCaRL. The method is practical for industrial scenarios, offering robust improvements across data-incremental settings and compatibility with existing continual-learning techniques.

Abstract

Data often arrives in sequence over time in real-world deep learning applications such as autonomous driving. When new training data is available, training the model from scratch undermines the benefit of leveraging the learned knowledge, leading to significant training costs. Warm-starting from a previously trained checkpoint is the most intuitive way to retain knowledge and advance learning. However, existing literature suggests that this warm-starting degrades generalization. In this paper, we advocate for warm-starting but stepping out of the previous converging point, thus allowing a better adaptation to new data without compromising previous knowledge. We propose Knowledge Consolidation and Acquisition (CKCA), a continuous model improvement algorithm with two novel components. First, a novel feature regularization (FeatReg) to retain and refine knowledge from existing checkpoints; Second, we propose adaptive knowledge distillation (AdaKD), a novel approach to forget mitigation and knowledge transfer. We tested our method on ImageNet using multiple splits of the training data. Our approach achieves up to $8.39\%$ higher top1 accuracy than the vanilla warm-starting and consistently outperforms the prior art with a large margin.

Step Out and Seek Around: On Warm-Start Training with Incremental Data

TL;DR

The paper addresses learning from streaming data under fixed training and storage costs, where warm-starting from a prior model can hurt generalization. It introduces CKCA, a continuous-learning framework with feature-space regularization (FeatReg) and adaptive knowledge distillation (AdaKD) to mitigate forgetting while leveraging new data. On ImageNet with 10 data-splits and no access to old data, CKCA achieves up to 8.39 percentage points improvement over warm-start and about 6.24 points over prior state-of-the-art, with further gains when combined with iCaRL. The method is practical for industrial scenarios, offering robust improvements across data-incremental settings and compatibility with existing continual-learning techniques.

Abstract

Data often arrives in sequence over time in real-world deep learning applications such as autonomous driving. When new training data is available, training the model from scratch undermines the benefit of leveraging the learned knowledge, leading to significant training costs. Warm-starting from a previously trained checkpoint is the most intuitive way to retain knowledge and advance learning. However, existing literature suggests that this warm-starting degrades generalization. In this paper, we advocate for warm-starting but stepping out of the previous converging point, thus allowing a better adaptation to new data without compromising previous knowledge. We propose Knowledge Consolidation and Acquisition (CKCA), a continuous model improvement algorithm with two novel components. First, a novel feature regularization (FeatReg) to retain and refine knowledge from existing checkpoints; Second, we propose adaptive knowledge distillation (AdaKD), a novel approach to forget mitigation and knowledge transfer. We tested our method on ImageNet using multiple splits of the training data. Our approach achieves up to higher top1 accuracy than the vanilla warm-starting and consistently outperforms the prior art with a large margin.
Paper Structure (11 sections, 7 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 11 sections, 7 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Accumulated training and storage costs as a function of the training strategy when training data is progressively available. Warm start involves using previously trained models and therefore increases the training costs. The goal here is to have fixed training and storage costs. That is, train a new model using existing models as initialization (warm-start) and only accessing to new data only.
  • Figure 2: Accuracy improvement brought by FeatReg and AdaKD depending on the amount of data available.
  • Figure 3: Knowledge Distillation: Comparison between the proposed adaptive distillation and other approaches. In early stages, all methods perform similarly. In later stages of training, however, our AdaKD clearly outperforms other approaches as we reduce the influence of the teacher when the student starts outperforming the teacher (checkpoint).