Table of Contents
Fetching ...

Core Tokensets for Data-efficient Sequential Training of Transformers

Subarnaduti Paul, Manuel Brack, Patrick Schramowski, Kristian Kersting, Martin Mundt

TL;DR

This work proposes to construct a deeper-level data summary on the level of tokens and demonstrates that core tokensets yield significant performance retention in incremental image classification, open-ended visual question answering, and continual image captioning with significantly reduced memory.

Abstract

Deep networks are frequently tuned to novel tasks and continue learning from ongoing data streams. Such sequential training requires consolidation of new and past information, a challenge predominantly addressed by retaining the most important data points - formally known as coresets. Traditionally, these coresets consist of entire samples, such as images or sentences. However, recent transformer architectures operate on tokens, leading to the famous assertion that an image is worth 16x16 words. Intuitively, not all of these tokens are equally informative or memorable. Going beyond coresets, we thus propose to construct a deeper-level data summary on the level of tokens. Our respectively named core tokensets both select the most informative data points and leverage feature attribution to store only their most relevant features. We demonstrate that core tokensets yield significant performance retention in incremental image classification, open-ended visual question answering, and continual image captioning with significantly reduced memory. In fact, we empirically find that a core tokenset of 1\% of the data performs comparably to at least a twice as large and up to 10 times larger coreset.

Core Tokensets for Data-efficient Sequential Training of Transformers

TL;DR

This work proposes to construct a deeper-level data summary on the level of tokens and demonstrates that core tokensets yield significant performance retention in incremental image classification, open-ended visual question answering, and continual image captioning with significantly reduced memory.

Abstract

Deep networks are frequently tuned to novel tasks and continue learning from ongoing data streams. Such sequential training requires consolidation of new and past information, a challenge predominantly addressed by retaining the most important data points - formally known as coresets. Traditionally, these coresets consist of entire samples, such as images or sentences. However, recent transformer architectures operate on tokens, leading to the famous assertion that an image is worth 16x16 words. Intuitively, not all of these tokens are equally informative or memorable. Going beyond coresets, we thus propose to construct a deeper-level data summary on the level of tokens. Our respectively named core tokensets both select the most informative data points and leverage feature attribution to store only their most relevant features. We demonstrate that core tokensets yield significant performance retention in incremental image classification, open-ended visual question answering, and continual image captioning with significantly reduced memory. In fact, we empirically find that a core tokenset of 1\% of the data performs comparably to at least a twice as large and up to 10 times larger coreset.
Paper Structure (20 sections, 9 equations, 9 figures, 2 tables)

This paper contains 20 sections, 9 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Continued training with core tokensets. In sequential training, forgetting previous tasks is mitigated by selecting the most meaningful previously observed data points and storing only their most influential tokens (yellow shaded parts; left side). The resulting core tokenset is then interleaved in the training of new tasks (blue shaded parts; right side), where random input dropout is further employed to render the model susceptible to observing partial inputs.
  • Figure 2: Transformers are not susceptible to partial inputs. We illustrate the fine-tuning behavior of a ViT-B/16 model on 20 random classes of ImageNet using core tokens. (a): Without countermeasures, performance drops massively as the model is unable to handle partial inputs. (b) Randomly zeroing-out tokens (through input dropout) of complete data points during training makes the model amenable to learn from core tokens.
  • Figure 3: Influence of token dropout. We perform continual training (CT) on five distinct sub-tasks. Randomly dropping tokens of novel inputs leads to an improvement of 40% in accuracy, in contrast to a scenario where the model is unable to handle the partial input of core tokensets.
  • Figure 4: Comparing core instances with core tokens. The accuracy of a ViT model on a CT task is shown. A memory buffer with core tokens outperforms traditional core sets across different memory sizes (x-axis), whereas their conjunction into core tokensets provides even more benefit.
  • Figure 5: Comparison of core tokenset selection techniques for sequential image classification. Methods vary through feature attribution technique, yet all demonstrate improvements over a traditional GradMatch-based core set.
  • ...and 4 more figures