Table of Contents
Fetching ...

Tuned Compositional Feature Replays for Efficient Stream Learning

Morgan B. Talbot, Rushikesh Zawar, Rohil Badkundri, Mengmi Zhang, Gabriel Kreiman

TL;DR

The paper tackles online stream learning, where models must continually learn from temporally coherent, non-repeating data without revisiting past samples. It introduces CRUMB, a differentiable codebook of memory blocks that compositionally reconstructs feature maps for memory-efficient replay, enabling performance close to offline upper bounds with only $3.6\%$ of the memory footprint of raw-image replay. CRUMB's pretraining induces a shape bias that stabilizes learning and reduces forgetting, and its replay operates at the feature level, yielding significant memory and runtime savings across seven continual-learning benchmarks and two newly adapted stream-learning datasets. The approach outperforms state-of-the-art baselines in most class-i.i.d. and class-instance settings, offers strong scalability to large datasets, and is adaptable across CNN architectures, making it well-suited for edge devices and robotic learning scenarios minus substantial data-storage overhead.

Abstract

Our brains extract durable, generalizable knowledge from transient experiences of the world. Artificial neural networks come nowhere close to this ability. When tasked with learning to classify objects by training on non-repeating video frames in temporal order (online stream learning), models that learn well from shuffled datasets catastrophically forget old knowledge upon learning new stimuli. We propose a new continual learning algorithm, Compositional Replay Using Memory Blocks (CRUMB), which mitigates forgetting by replaying feature maps reconstructed by combining generic parts. CRUMB concatenates trainable and re-usable "memory block" vectors to compositionally reconstruct feature map tensors in convolutional neural networks. Storing the indices of memory blocks used to reconstruct new stimuli enables memories of the stimuli to be replayed during later tasks. This reconstruction mechanism also primes the neural network to minimize catastrophic forgetting by biasing it towards attending to information about object shapes more than information about image textures, and stabilizes the network during stream learning by providing a shared feature-level basis for all training examples. These properties allow CRUMB to outperform an otherwise identical algorithm that stores and replays raw images, while occupying only 3.6% as much memory. We stress-tested CRUMB alongside 13 competing methods on 7 challenging datasets. To address the limited number of existing online stream learning datasets, we introduce 2 new benchmarks by adapting existing datasets for stream learning. With only 3.7-4.1% as much memory and 15-43% as much runtime, CRUMB mitigates catastrophic forgetting more effectively than the state-of-the-art. Our code is available at https://github.com/MorganBDT/crumb.git.

Tuned Compositional Feature Replays for Efficient Stream Learning

TL;DR

The paper tackles online stream learning, where models must continually learn from temporally coherent, non-repeating data without revisiting past samples. It introduces CRUMB, a differentiable codebook of memory blocks that compositionally reconstructs feature maps for memory-efficient replay, enabling performance close to offline upper bounds with only of the memory footprint of raw-image replay. CRUMB's pretraining induces a shape bias that stabilizes learning and reduces forgetting, and its replay operates at the feature level, yielding significant memory and runtime savings across seven continual-learning benchmarks and two newly adapted stream-learning datasets. The approach outperforms state-of-the-art baselines in most class-i.i.d. and class-instance settings, offers strong scalability to large datasets, and is adaptable across CNN architectures, making it well-suited for edge devices and robotic learning scenarios minus substantial data-storage overhead.

Abstract

Our brains extract durable, generalizable knowledge from transient experiences of the world. Artificial neural networks come nowhere close to this ability. When tasked with learning to classify objects by training on non-repeating video frames in temporal order (online stream learning), models that learn well from shuffled datasets catastrophically forget old knowledge upon learning new stimuli. We propose a new continual learning algorithm, Compositional Replay Using Memory Blocks (CRUMB), which mitigates forgetting by replaying feature maps reconstructed by combining generic parts. CRUMB concatenates trainable and re-usable "memory block" vectors to compositionally reconstruct feature map tensors in convolutional neural networks. Storing the indices of memory blocks used to reconstruct new stimuli enables memories of the stimuli to be replayed during later tasks. This reconstruction mechanism also primes the neural network to minimize catastrophic forgetting by biasing it towards attending to information about object shapes more than information about image textures, and stabilizes the network during stream learning by providing a shared feature-level basis for all training examples. These properties allow CRUMB to outperform an otherwise identical algorithm that stores and replays raw images, while occupying only 3.6% as much memory. We stress-tested CRUMB alongside 13 competing methods on 7 challenging datasets. To address the limited number of existing online stream learning datasets, we introduce 2 new benchmarks by adapting existing datasets for stream learning. With only 3.7-4.1% as much memory and 15-43% as much runtime, CRUMB mitigates catastrophic forgetting more effectively than the state-of-the-art. Our code is available at https://github.com/MorganBDT/crumb.git.

Paper Structure

This paper contains 39 sections, 4 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Schematic of online stream learning protocols. For each task, the model learns to classify a set of new classes (C1, C2, etc. in figure) while training on video clips of several objects from each class (O1, O2) for only one epoch. During testing, the model has to classify images from all seen classes without knowing task identity. In the class-instance training protocol, the order of video clips is shuffled but the order of frame images is preserved within each clip. In the class-i.i.d. training protocol, all images within each task are randomly shuffled. Class-i.i.d. is the only option for datasets such as ImageNet that consist of standalone images and not video clips.
  • Figure 2: Schematic illustration of CRUMB, the proposed algorithm for online stream learning. The model consists of a CNN ($\mathbf{F(\cdot)}$ for early layers and $\mathbf{P(\cdot)}$ for later layers) and a codebook matrix $\mathbf{B}$ used for compositional reconstruction of feature-level activation tensors (feature maps $\mathbf{Z}$). Each row $\mathbf{B_k}$ of $\mathbf{B}$ is a "memory block" vector. CRUMB uses the feature extractor $\mathbf{F(\cdot)}$ to produce an initial feature map, then determines which memory blocks to retrieve from $\mathbf{B}$ based on a cosine-similarity addressing mechanism. The feature maps reconstructed from the memory blocks ($\mathbf{\widetilde{Z}}$), and the original feature maps ($\mathbf{Z}$), are used to obtain separate classification losses from the same classifier network $\mathbf{P(\cdot)}$ ("codebook-out loss" and "direct loss", respectively). Only codebook-out loss is used for weight updates during stream learning, although the two losses are added in a weighted sum to calculate the total loss during pretraining. To avoid catastrophic forgetting, we store the row indices of retrieved memory blocks along with class labels for example images from each task. In later tasks, following each batch of new images, we "replay" a batch of old feature maps to $\mathbf{P(\cdot)}$ after reconstructing them using stored memory block indices.
  • Figure 3: CRUMB outperforms most baseline algorithms and approaches the upper bound on some datasets. Line plots show top-1 accuracy in online stream learning on video datasets (a) CORe50 (b) Toybox (c) iLab (d) iLab + CORe50 (e) iCub in the class-instance training protocol (class-i.i.d. plots are in supplementary Fig. S1), as well as image datasets (g) Online-CIFAR100 and (h) Online-ImageNet (class-i.i.d.). All models train on the first task for many epochs, but view each image only once on all subsequent tasks. Accuracy estimates are the mean from 10 runs (5 runs for ImageNet), where each run has different class and image/video clip orderings. Error bars show the root-mean-square error (RMSE) among runs. Results for all baselines are in Table \ref{['tab:all_results']}.
  • Figure 4: CRUMB pretraining induces a bias towards shape information that often persists through stream learning. The height of each bar shows how much smaller (or larger, if negative) CRUMB's drop in normalized test set accuracy under a perturbation is, in comparison to a control network (see section \ref{['sec:pretraining_primes']}). "Spatial perturbation" shuffles the spatial positions of all feature vectors in an intermediate feature map (at the same layer where it is reconstructed by CRUMB), "feature perturbation" randomly sets half of the feature map's features to zero, and "style perturbation" uses images from Stylized-ImageNet geirhos2018. Streaming results (to the right of grey dotted line) are in the class-instance setting for the video datasets and class-i.i.d. for CIFAR100 and ImageNet. Error bars are standard errors of the mean of relative accuracy advantage among 5 (CIFAR100 and ImageNet) or 10 (other datasets) independent runs. * denotes a statistically significant difference from 0, as determined by a Wilcoxon signed-rank test (see supplementary Section S4.B).
  • Figure 5: Some memory blocks appear to have semantic interpretations. Panel a shows images of "remote controls" and "cans" in the CORe50 test set, showing all-or-none activation of specific memory blocks at corresponding image locations. Of the 256 memory blocks in the codebook, blocks with indices 32 and 48 (blue squares) both similarly respond to greyish background regions, but not bright white or other backgrounds. Blocks 201 and 205 (red) both respond to buttons on remote controls and features of drink cans, while block 197 (yellow) responds only to can features. Similar blocks are aggregated by color (for blue and red) to produce a clearer visualization. Panel b shows the sorted usage frequencies in the CORe50 test set of each of the 256 memory blocks. Colored arrows show the blocks visualized in panel a. The upward black arrow shows the most-used block with frequency 4.4e-5.
  • ...and 3 more figures