Table of Contents
Fetching ...

Step-size Optimization for Continual Learning

Thomas Degris, Khurram Javed, Arsalan Sharifnassab, Yuxin Liu, Richard Sutton

TL;DR

The paper investigates how step-size design affects continual learning, contrasting per-parameter optimization with gradient-normalization approaches. It shows that normalization-based methods like RMSProp/Adam can fail to align per-dimension updates with the true objective, while the IDBD meta-gradient approach directly optimizes step-sizes $\boldsymbol{\alpha}_t$ w.r.t. the lifetime loss $J_t$, yielding better adaptation on simple problems. The work highlights limitations of IDBD, notably sensitivity to the meta-step-size, and surveys extensions and related methodologies, arguing that combining normalization and meta-gradient optimization could yield robust, scalable step-size adaptation. This synthesis points to a fertile direction for improving neural networks in continual learning and possibly broader online learning tasks, potentially reducing manual hyperparameter tuning through principled normalization-assisted optimization.

Abstract

In continual learning, a learner has to keep learning from the data over its whole life time. A key issue is to decide what knowledge to keep and what knowledge to let go. In a neural network, this can be implemented by using a step-size vector to scale how much gradient samples change network weights. Common algorithms, like RMSProp and Adam, use heuristics, specifically normalization, to adapt this step-size vector. In this paper, we show that those heuristics ignore the effect of their adaptation on the overall objective function, for example by moving the step-size vector away from better step-size vectors. On the other hand, stochastic meta-gradient descent algorithms, like IDBD (Sutton, 1992), explicitly optimize the step-size vector with respect to the overall objective function. On simple problems, we show that IDBD is able to consistently improve step-size vectors, where RMSProp and Adam do not. We explain the differences between the two approaches and their respective limitations. We conclude by suggesting that combining both approaches could be a promising future direction to improve the performance of neural networks in continual learning.

Step-size Optimization for Continual Learning

TL;DR

The paper investigates how step-size design affects continual learning, contrasting per-parameter optimization with gradient-normalization approaches. It shows that normalization-based methods like RMSProp/Adam can fail to align per-dimension updates with the true objective, while the IDBD meta-gradient approach directly optimizes step-sizes w.r.t. the lifetime loss , yielding better adaptation on simple problems. The work highlights limitations of IDBD, notably sensitivity to the meta-step-size, and surveys extensions and related methodologies, arguing that combining normalization and meta-gradient optimization could yield robust, scalable step-size adaptation. This synthesis points to a fertile direction for improving neural networks in continual learning and possibly broader online learning tasks, potentially reducing manual hyperparameter tuning through principled normalization-assisted optimization.

Abstract

In continual learning, a learner has to keep learning from the data over its whole life time. A key issue is to decide what knowledge to keep and what knowledge to let go. In a neural network, this can be implemented by using a step-size vector to scale how much gradient samples change network weights. Common algorithms, like RMSProp and Adam, use heuristics, specifically normalization, to adapt this step-size vector. In this paper, we show that those heuristics ignore the effect of their adaptation on the overall objective function, for example by moving the step-size vector away from better step-size vectors. On the other hand, stochastic meta-gradient descent algorithms, like IDBD (Sutton, 1992), explicitly optimize the step-size vector with respect to the overall objective function. On simple problems, we show that IDBD is able to consistently improve step-size vectors, where RMSProp and Adam do not. We explain the differences between the two approaches and their respective limitations. We conclude by suggesting that combining both approaches could be a promising future direction to improve the performance of neural networks in continual learning.
Paper Structure (8 sections, 4 equations, 4 figures, 4 algorithms)

This paper contains 8 sections, 4 equations, 4 figures, 4 algorithms.

Figures (4)

  • Figure 1: With conventional step-size normalization methods like RMSProp, the step-sizes do not go towards the optimal step-sizes.
  • Figure 2: On the weight-flipping problem, IDBD performs as well as Oracle SGD and better than conventional methods.
  • Figure 3: On the noisy-tracking problem, step-size optimization (IDBD) can accurately track the optimal step-size on a non-stationary single dimension problem. Step-size normalization, as done by RMSProp, on the other hand, achieves exactly the opposite---it increases the step-size when it should be decreased and vice-versa.
  • Figure 4: Shift of the best meta step-size parameter of the IDBD algorithm.