Table of Contents
Fetching ...

Task agnostic continual learning with Pairwise layer architecture

Santtu Keskinen

TL;DR

Problem: catastrophic forgetting in sequential learning without task boundaries. Approach: a static architecture featuring a Pairwise Interaction Layer built on sparse $k$-WTA activations, plus a streaming per-parameter importance mechanism that adapts learning rates via $1/\ oot{\!}{I_i}$. Contributions: introduction of the PW-layer, evaluation of Adagrad and S-MAS for online task-agnostic learning, and demonstration of competitive performance on Split MNIST, Permuted MNIST, and Split Fashion-MNIST compared to rehearsal-free baselines. Impact: shows that architectural design and online importance-based updates can enable rehearsal-free continual learning without explicit task labels, with public code for reproducibility and potential scalability to larger settings.

Abstract

Most of the dominant approaches to continual learning are based on either memory replay, parameter isolation, or regularization techniques that require task boundaries to calculate task statistics. We propose a static architecture-based method that doesn't use any of these. We show that we can improve the continual learning performance by replacing the final layer of our networks with our pairwise interaction layer. The pairwise interaction layer uses sparse representations from a Winner-take-all style activation function to find the relevant correlations in the hidden layer representations. The networks using this architecture show competitive performance in MNIST and FashionMNIST-based continual image classification experiments. We demonstrate this in an online streaming continual learning setup where the learning system cannot access task labels or boundaries.

Task agnostic continual learning with Pairwise layer architecture

TL;DR

Problem: catastrophic forgetting in sequential learning without task boundaries. Approach: a static architecture featuring a Pairwise Interaction Layer built on sparse -WTA activations, plus a streaming per-parameter importance mechanism that adapts learning rates via . Contributions: introduction of the PW-layer, evaluation of Adagrad and S-MAS for online task-agnostic learning, and demonstration of competitive performance on Split MNIST, Permuted MNIST, and Split Fashion-MNIST compared to rehearsal-free baselines. Impact: shows that architectural design and online importance-based updates can enable rehearsal-free continual learning without explicit task labels, with public code for reproducibility and potential scalability to larger settings.

Abstract

Most of the dominant approaches to continual learning are based on either memory replay, parameter isolation, or regularization techniques that require task boundaries to calculate task statistics. We propose a static architecture-based method that doesn't use any of these. We show that we can improve the continual learning performance by replacing the final layer of our networks with our pairwise interaction layer. The pairwise interaction layer uses sparse representations from a Winner-take-all style activation function to find the relevant correlations in the hidden layer representations. The networks using this architecture show competitive performance in MNIST and FashionMNIST-based continual image classification experiments. We demonstrate this in an online streaming continual learning setup where the learning system cannot access task labels or boundaries.
Paper Structure (22 sections, 3 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of (a) a normal fully connected layer with 4 inputs and 3 outputs, (b) a fully connected pairwise layer with 4 inputs, 6 expanded pairwise feature cross nodes and 3 outputs, and (c) a sparse pairwise interaction layer with just 3 trainable weights. Solid lines represent trainable weights. Each feature cross node multiplies 2 of the inputs together (illustrated by the dashed lines). The grey feature cross nodes are not connected to any outputs and can be pruned to save compute.
  • Figure 2: (A) Overall accuracy on all the tasks learned so far in Permuted MNIST with the small MLP architectures (0.8M parameters). WTA Adagrad and Pairwise Adagrad accuracies overlap almost exactly for most of the training. (B) Accuracies on the first, fourth, and eighth tasks for WTA Adagrad and Pairwise Adagrad.
  • Figure 3: Split MNIST with the small MLP architecture (0.8M parameters) with different values for hidden layer sparsity.