Table of Contents
Fetching ...

Model Zoo: A Growing "Brain" That Learns Continually

Rahul Ramesh, Pratik Chaudhari

TL;DR

The paper addresses the problem of continual learning under task interdependencies and interference. It proposes Model Zoo, a boosting-inspired ensemble approach that grows across episodes by training small models on current and past tasks and averaging predictions for each task, enabling strong forward and backward transfer. The authors develop a theoretical framework around task relatedness and transfer exponents to characterize when joint learning helps and provide bounds on excess risk; they also demonstrate empirically that capacity splitting across tasks can outperform existing methods across a broad benchmark suite. A striking finding is that simple isolated training can outperform many continual-learning methods in some settings, underscoring the importance of evaluating baselines and the potential of capacity-splitting with data replay. Overall, Model Zoo offers a scalable, effective strategy for continual learning by dynamically expanding learning capacity and leveraging cross-task synergies while mitigating detrimental task competition.

Abstract

This paper argues that continual learning methods can benefit by splitting the capacity of the learner across multiple models. We use statistical learning theory and experimental analysis to show how multiple tasks can interact with each other in a non-trivial fashion when a single model is trained on them. The generalization error on a particular task can improve when it is trained with synergistic tasks, but can also deteriorate when trained with competing tasks. This theory motivates our method named Model Zoo which, inspired from the boosting literature, grows an ensemble of small models, each of which is trained during one episode of continual learning. We demonstrate that Model Zoo obtains large gains in accuracy on a variety of continual learning benchmark problems. Code is available at https://github.com/grasp-lyrl/modelzoo_continual.

Model Zoo: A Growing "Brain" That Learns Continually

TL;DR

The paper addresses the problem of continual learning under task interdependencies and interference. It proposes Model Zoo, a boosting-inspired ensemble approach that grows across episodes by training small models on current and past tasks and averaging predictions for each task, enabling strong forward and backward transfer. The authors develop a theoretical framework around task relatedness and transfer exponents to characterize when joint learning helps and provide bounds on excess risk; they also demonstrate empirically that capacity splitting across tasks can outperform existing methods across a broad benchmark suite. A striking finding is that simple isolated training can outperform many continual-learning methods in some settings, underscoring the importance of evaluating baselines and the potential of capacity-splitting with data replay. Overall, Model Zoo offers a scalable, effective strategy for continual learning by dynamically expanding learning capacity and leveraging cross-task synergies while mitigating detrimental task competition.

Abstract

This paper argues that continual learning methods can benefit by splitting the capacity of the learner across multiple models. We use statistical learning theory and experimental analysis to show how multiple tasks can interact with each other in a non-trivial fashion when a single model is trained on them. The generalization error on a particular task can improve when it is trained with synergistic tasks, but can also deteriorate when trained with competing tasks. This theory motivates our method named Model Zoo which, inspired from the boosting literature, grows an ensemble of small models, each of which is trained during one episode of continual learning. We demonstrate that Model Zoo obtains large gains in accuracy on a variety of continual learning benchmark problems. Code is available at https://github.com/grasp-lyrl/modelzoo_continual.

Paper Structure

This paper contains 47 sections, 1 theorem, 17 equations, 14 figures, 6 tables.

Key Result

Theorem 2

Say we wish to find a good hypothesis for task $P_1$ and have access to $n$ tasks $P_1, \ldots, P_n$ where each pair $P_i, P_j$ are $\rho_{ij}$-related. Arrange tasks in an increasing order of $\rho_{i 1}$, i.e., their relatedness to $P_1$. Let this ordering be $P_{(1)}, P_{(2)}, \ldots, P_{(n)}$ wi where $\rho_{\max}(k) = \max \left\{\rho_{(1)}, \ldots, \rho_{(k)}\right\}$ and $c, c'$ are constan

Figures (14)

  • Figure 1: Left: How well do existing continual learning methods work in the single-epoch setting? We track the average accuracy (over all tasks seen until the current episode) on the Split-miniImagenet dataset. All methods in this plot (unless specified otherwise) are evaluated in the single-epoch setting lopez2017gradient, i.e., each new task is allowed only 1 epoch of training. We compare our method Model Zoo and its variants (all in bold) to existing continual learning methods designed for the single-epoch setting (faint lines, see \ref{['tab:lifelong_main']} for references). Isolated refers to a very simplistic realization of Model Zoo where a separate model is fitted at each episode without any continual learning, or data sharing between tasks; Isolated-small or Model Zoo-small refer to using a very small deep network with 0.12M weights. A number of surprising findings are seen here. (i) Isolated-small (black) outperforms existing methods by more than 10%, while having a faster training time, inference time, comparable model size and without performing any data replay. This indicates that existing methods do not sufficiently leverage data from multiple tasks. This also indicates the utility of simple methods like Isolated to perform a more prosaic, matter-of-fact, evaluation of continual learning. (ii) While the larger model with 3.6M weights per round, Isolated-Single Epoch (royal blue), performs poorly, its accuracy is better than existing methods (Isolated-Multi Epoch) upon being trained for multiple epochs. This indicates that methods may be severely under-trained in the single-epoch setting and this may not be the appropriate setting to build continual learning methods; this was also noticed by lopez2017gradient. (iii) Model Zoo and Model Zoo-small which replay all data from past tasks (A-GEM also replays 10% of the data), achieves around 10% improvement over its Isolated counterparts in both the single-epoch and multi-epoch setting; Model Zoo has an improved ability to solve each task by leveraging other tasks. This indicates that replaying data from past tasks is beneficial robins1995catastrophic, even if replay may not conform to certain stylistic formulations of continual learning in the literature farquharrobustevaluationscontinual2019kaushik2021understanding. Not doing so significantly hurts forward and backward transfer, and average task accuracy. Right: Does the single-epoch setting show forward-backward transfer? The evolution of individual task accuracy of Model Zoo (the multi-epoch setting in bold and single-epoch setting in dotted), on the Split-miniImagenet dataset (only 5 tasks are plotted here, see \ref{['fig:app:task_acc_coarse']} for the full version). The X markers denote the accuracy of Isolated. Accuracy of tasks improves with each episode which indicates backward transfer. Also, the X markers are often below the initial accuracy of the task during continual learning, which indicates forward transfer. While both single-epoch and multi-epoch Model Zoo show good forward-backward transfer, the accuracy of tasks for the former is about 25% worse than the latter; corresponding plots for other methods are in \ref{['s:app:task_acc']}. This indicates that we should also pay attention to under-training and per-task accuracy in continual learning.
  • Figure 2: Competition between tasks in continual learning can be non-trivial. In order to demonstrate how some tasks help and some tasks hurt each other, we run a multi-task learner for a varying number of tasks (X-axis) and track the accuracy on a few tasks from CIFAR100 (each task is a superclass). Each cell represents a different experiment, i.e, there is no continual learning being performed here. Cells are colored warm if accuracy is worse than the median accuracy of that row. For instance, multi-task training with 11 tasks is beneficial for "Man-made Outdoor" but accuracy drops drastically upon introducing task #12, it improves upon introducing #14, while task #17 again leads to a drop. One may study the other rows to reach a similar conclusion: there is non-trivial competition between tasks, even in commonly used datasets. As we show, tackling this effectively is the key to obtaining good performance on continual learning problems. See \ref{['s:app:moretasks']} for a more elaborate version.
  • Figure 3: Ideally, we want to train synergistic tasks together, e.g., Model 1 for $P_1$ using $P_3, P_6$ and Model 3 for $P_3$ using $P_1, P_4, P_5$. At test time, all models (1, 2, 3) that were trained on a particular task, say $P_1$ would make predictions. Model Zoo is a simple, scalable instantiation of this idea. Discovering noncompeting tasks is difficult, so it selects tasks that have high training loss under the current ensemble.
  • Figure 4: Ablation studies that show the average per-task accuracy as we vary the size of data replay for Model Zoo (left), the number of past tasks sampled at each episode (middle, $\mathscr{b}=1$ implies no replay), and compare Model Zoo with an ensemble of Isolated models (right). These results are for the single-epoch setting and are therefore directly comparable to those in \ref{['tab:lifelong_metrics']} and \ref{['tab:lifelong_main']} as far as comparison to other methods is concerned. Accuracy is roughly the same on Split-CIFAR100 across varying degrees of replay while it improves significantly on Split-miniImagenet; this suggests that Model Zoo also works with very small amounts of data replay. Accuracy on Split-CIFAR100 is consistent as the number of replay tasks is changed but increases on larger datasets like Split-miniImagenet where there are many more tasks. Finally, the performance of Model Zoo is not merely an artifact of ensembling. Even if Isolated is a strong model, a very large ensemble of Isolated compares poorly to Model Zoo with 100% replay; this indicates that Model Zoo can effectively leverage data from past tasks without forgetting. See the Appendix for more ablation studies.
  • Figure A1: Pairwise task competition matrix. Cells are colored by the gain(green)/loss(warm) of accuracy of pairwise Multi-Head training as compared to training the row-task in isolation; this is a good proxy for the transfer coefficient $\rho_{ij}$ in \ref{['eq:transfer_exponent']}. Although most pairs benefit each other (green), certain tasks, e.g., "Food Container" are best trained in isolation while others such as "Aquatic Mammals" are typically detrimental to most other tasks. One can study this matrix and identify many more such properties. In summary, whether tasks aid or hurt each other is quite nuanced even for CIFAR100.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Remark 1: Data from other tasks may not improve accuracy even if they are synergistic
  • Theorem 2: Task competition
  • Remark 3: Picking the size of the hypothesis space
  • Remark 4: The set of synergistic tasks can be different for different tasks
  • Remark 5: Continual learning is particularly challenging due to task competition
  • Remark 6: Assumptions in the formulation of Model Zoo