Table of Contents
Fetching ...

Transformers for Supervised Online Continual Learning

Jorg Bornschein, Yazhe Li, Amal Rannen-Triki

TL;DR

This approach explicitly conditions a transformer on recent observations, while at the same time online training it with stochastic gradient descent, following the procedure introduced with Transformer-XL, and hypothesize that this combination enables fast adaptation through in-context learning and sustained longterm improvement via parametric learning.

Abstract

Transformers have become the dominant architecture for sequence modeling tasks such as natural language processing or audio processing, and they are now even considered for tasks that are not naturally sequential such as image classification. Their ability to attend to and to process a set of tokens as context enables them to develop in-context few-shot learning abilities. However, their potential for online continual learning remains relatively unexplored. In online continual learning, a model must adapt to a non-stationary stream of data, minimizing the cumulative nextstep prediction loss. We focus on the supervised online continual learning setting, where we learn a predictor $x_t \rightarrow y_t$ for a sequence of examples $(x_t, y_t)$. Inspired by the in-context learning capabilities of transformers and their connection to meta-learning, we propose a method that leverages these strengths for online continual learning. Our approach explicitly conditions a transformer on recent observations, while at the same time online training it with stochastic gradient descent, following the procedure introduced with Transformer-XL. We incorporate replay to maintain the benefits of multi-epoch training while adhering to the sequential protocol. We hypothesize that this combination enables fast adaptation through in-context learning and sustained longterm improvement via parametric learning. Our method demonstrates significant improvements over previous state-of-the-art results on CLOC, a challenging large-scale real-world benchmark for image geo-localization.

Transformers for Supervised Online Continual Learning

TL;DR

This approach explicitly conditions a transformer on recent observations, while at the same time online training it with stochastic gradient descent, following the procedure introduced with Transformer-XL, and hypothesize that this combination enables fast adaptation through in-context learning and sustained longterm improvement via parametric learning.

Abstract

Transformers have become the dominant architecture for sequence modeling tasks such as natural language processing or audio processing, and they are now even considered for tasks that are not naturally sequential such as image classification. Their ability to attend to and to process a set of tokens as context enables them to develop in-context few-shot learning abilities. However, their potential for online continual learning remains relatively unexplored. In online continual learning, a model must adapt to a non-stationary stream of data, minimizing the cumulative nextstep prediction loss. We focus on the supervised online continual learning setting, where we learn a predictor for a sequence of examples . Inspired by the in-context learning capabilities of transformers and their connection to meta-learning, we propose a method that leverages these strengths for online continual learning. Our approach explicitly conditions a transformer on recent observations, while at the same time online training it with stochastic gradient descent, following the procedure introduced with Transformer-XL. We incorporate replay to maintain the benefits of multi-epoch training while adhering to the sequential protocol. We hypothesize that this combination enables fast adaptation through in-context learning and sustained longterm improvement via parametric learning. Our method demonstrates significant improvements over previous state-of-the-art results on CLOC, a challenging large-scale real-world benchmark for image geo-localization.
Paper Structure (31 sections, 3 equations, 15 figures, 1 table)

This paper contains 31 sections, 3 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Instantaneous and averaged prediction performance for the first and last 10 tasks of Split-EMNIST: Image-to-label mappings are constant within each task, but randomly reassigned at task boundaries (averaged over 200 data generating random seeds).
  • Figure 2: Alternative visualization of the Split-EMNIST experiments from Fig. \ref{['fig:emnist-plots']}: Left: Average performance per task shows strong forward-transfer after struggling during the first 10 to 20 tasks. Right: Detailed look at the within-task performance for the scenario with 1000 examples per task. The model is a strong few-shot learner after seeing 30 tasks and further improves until at least task 100
  • Figure 3: Average accuracy at the end of the sequence for different amount of replay (epochs) and learning-rates on Split-EMNIST (1000 examples per task, 100 tasks).
  • Figure 4: CLOC with pretrained and frozen feature extractors. We show the best performing models (in terms of final avg. accuracy) from the hyper-parameter cube in \ref{['app:cloc-features-hypercube']} We also show pi-Transformer ablations: either without input features $x_t$, or without attention ($C=0$).
  • Figure 5: Stopping gradient updates at various positions to investigate the performance of in-context conditioning alone. Here for the pi-Transformer on CLOC with a fixed, pre-trained ResNet-50 feature extractor. Gradient based online-learning is most important at the beginning of the sequence, however keeps on contributing even after 10M examples.
  • ...and 10 more figures