Distributed Continual Learning

Long Le; Marcel Hussing; Eric Eaton

Distributed Continual Learning

Long Le, Marcel Hussing, Eric Eaton

TL;DR

Distributed Continual Learning (DCL) studies a network of heterogeneous agents that sequentially encounter tasks and must exchange knowledge under budgets and topologies. The authors formalize DCL on a directed graph $\\mathcal{G}$ with a cumulative objective to minimize the total expected loss across tasks while constraining knowledge transfer with $b_{ij}$, $f_{ij}$, and a global clock $C$, and they compare data-instance, full-model, and modular parameter sharing, including a reusable-module approach called modmod. Empirical results across MNIST variants and CIFAR-100 show that modular parameter sharing accelerates early learning and reduces communication, data sharing yields strong final accuracy on easier tasks, and combining modalities delivers the best overall performance under realistic budgets. The work provides robust baselines, highlights the hidden costs of communication, and broadens the evaluation framework for DCL with heterogeneous agents, pointing toward extensions to reinforcement learning and more complex transfer strategies.

Abstract

This work studies the intersection of continual and federated learning, in which independent agents face unique tasks in their environments and incrementally develop and share knowledge. We introduce a mathematical framework capturing the essential aspects of distributed continual learning, including agent model and statistical heterogeneity, continual distribution shift, network topology, and communication constraints. Operating on the thesis that distributed continual learning enhances individual agent performance over single-agent learning, we identify three modes of information exchange: data instances, full model parameters, and modular (partial) model parameters. We develop algorithms for each sharing mode and conduct extensive empirical investigations across various datasets, topology structures, and communication limits. Our findings reveal three key insights: sharing parameters is more efficient than sharing data as tasks become more complex; modular parameter sharing yields the best performance while minimizing communication costs; and combining sharing modes can cumulatively improve performance.

Distributed Continual Learning

TL;DR

with a cumulative objective to minimize the total expected loss across tasks while constraining knowledge transfer with

, and a global clock

, and they compare data-instance, full-model, and modular parameter sharing, including a reusable-module approach called modmod. Empirical results across MNIST variants and CIFAR-100 show that modular parameter sharing accelerates early learning and reduces communication, data sharing yields strong final accuracy on easier tasks, and combining modalities delivers the best overall performance under realistic budgets. The work provides robust baselines, highlights the hidden costs of communication, and broadens the evaluation framework for DCL with heterogeneous agents, pointing toward extensions to reinforcement learning and more complex transfer strategies.

Abstract

Paper Structure (29 sections, 1 equation, 5 figures, 3 tables)

This paper contains 29 sections, 1 equation, 5 figures, 3 tables.

Introduction
Related Work
Federated learning
Continual Learning
Multi-agent Systems
Distributed Continual Learning
Distributed Continual Learning Framework
Modes of Knowledge Sharing
Data Sharing
Full Model Parameter Sharing
Modular Parameter Sharing
Experiments
What to share in a monolithic distributed setting?
What to share in a modular distributed setting?
How do communication constraints affect the efficacy of sharing modes?
...and 14 more sections

Figures (5)

Figure 1: Mean test accuracy and standard error of the no-sharing baseline, data sharing, and federated learning with monolithic models. Sharing data is best for easier tasks while federated learning is best for harder tasks (i.e., CIFAR-100). In the heterogeneous $\mathtt{combined}$ dataset, where agents face very different tasks, aggregating models via federated methods is worse than single-agent learning.
Figure 2: Mean test accuracy and standard error of the no-sharing baseline, data sharing, full-model sharing (federated learning), and partial model sharing ($\mathtt{modmod}$) with modular models. $\mathtt{modmod}$ outperforms all other methods in terms of learning speed while sharing data reaches the highest final accuracy in less difficult datasets. Federated methods are less effective in modular models; one exception is in the more difficult CIFAR-100 dataset, where it is better to share parameters than data.
Figure 3: Relative gain in final accuracy versus the log of communication cost, $\log(\mathtt{B})$. Lines linearly fit through each sharing mode show the general trend. The results are averaged across agents, random seeds, and datasets. Qualitatively, $\mathtt{modmod}$ improves the final accuracy over isolated learning while requiring substantially less communication than other sharing modes, as further corroborated by Table \ref{['tab:marginal_gains_budget']}.
Figure 4: Relative gain in final average accuracy versus topology with modular and monolithic models. We compare four common topologies: Erdös-Rényi graphs with the probability of each edge $p \in \{1, 0.75, 0.5, 0.3, 0.1\}$, ring, server, and tree. Results are averaged across agents, random seeds, and datasets. On the vertical axis, relative gain = 0% corresponds to isolated single-agent learning with a disconnected topology ($p=0$). The results show that performance degrades as the graph becomes sparser, i.e. for low values of $p$. Ring, server, and tree have similar performance, falling between a fully connected topology ($p=1$) and the no-sharing baseline. In monolithic neural networks, the server is the worst performing among the three.
Figure 5: Mean test accuracy and standard error of a hybrid combination of all sharing modes compared against individual modes in modular (top) and monolithic (bottom) networks. In modular networks, hybrid mode achieves the best performance both in learning speed and final accuracy. The superiority of hybrid mode is less observed in monolithic networks.

Distributed Continual Learning

TL;DR

Abstract

Distributed Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)