Table of Contents
Fetching ...

Masked Autoencoders are Efficient Continual Federated Learners

Subarnaduti Paul, Lars-Joel Frey, Roshni Kamath, Kristian Kersting, Martin Mundt

TL;DR

The paper addresses unsupervised continual learning in a federated setting where client data drift and task changes occur over time. It proposes CONFEDMADE, a framework that combines masked autoencoders with FedWeIT-style parameter decomposition to enable selective, memory-efficient knowledge transfer across clients. Empirical results on image and binary datasets show reduced forgetting and substantial communication savings, with attention analyses illustrating when shared knowledge aids client learning. This approach advances practical distributed representation learning under non-stationary, privacy-preserving constraints and points toward scalable extensions to transformer-based architectures for multi-modal data.

Abstract

Machine learning is typically framed from a perspective of i.i.d., and more importantly, isolated data. In parts, federated learning lifts this assumption, as it sets out to solve the real-world challenge of collaboratively learning a shared model from data distributed across clients. However, motivated primarily by privacy and computational constraints, the fact that data may change, distributions drift, or even tasks advance individually on clients, is seldom taken into account. The field of continual learning addresses this separate challenge and first steps have recently been taken to leverage synergies in distributed supervised settings, in which several clients learn to solve changing classification tasks over time without forgetting previously seen ones. Motivated by these prior works, we posit that such federated continual learning should be grounded in unsupervised learning of representations that are shared across clients; in the loose spirit of how humans can indirectly leverage others' experience without exposure to a specific task. For this purpose, we demonstrate that masked autoencoders for distribution estimation are particularly amenable to this setup. Specifically, their masking strategy can be seamlessly integrated with task attention mechanisms to enable selective knowledge transfer between clients. We empirically corroborate the latter statement through several continual federated scenarios on both image and binary datasets.

Masked Autoencoders are Efficient Continual Federated Learners

TL;DR

The paper addresses unsupervised continual learning in a federated setting where client data drift and task changes occur over time. It proposes CONFEDMADE, a framework that combines masked autoencoders with FedWeIT-style parameter decomposition to enable selective, memory-efficient knowledge transfer across clients. Empirical results on image and binary datasets show reduced forgetting and substantial communication savings, with attention analyses illustrating when shared knowledge aids client learning. This approach advances practical distributed representation learning under non-stationary, privacy-preserving constraints and points toward scalable extensions to transformer-based architectures for multi-modal data.

Abstract

Machine learning is typically framed from a perspective of i.i.d., and more importantly, isolated data. In parts, federated learning lifts this assumption, as it sets out to solve the real-world challenge of collaboratively learning a shared model from data distributed across clients. However, motivated primarily by privacy and computational constraints, the fact that data may change, distributions drift, or even tasks advance individually on clients, is seldom taken into account. The field of continual learning addresses this separate challenge and first steps have recently been taken to leverage synergies in distributed supervised settings, in which several clients learn to solve changing classification tasks over time without forgetting previously seen ones. Motivated by these prior works, we posit that such federated continual learning should be grounded in unsupervised learning of representations that are shared across clients; in the loose spirit of how humans can indirectly leverage others' experience without exposure to a specific task. For this purpose, we demonstrate that masked autoencoders for distribution estimation are particularly amenable to this setup. Specifically, their masking strategy can be seamlessly integrated with task attention mechanisms to enable selective knowledge transfer between clients. We empirically corroborate the latter statement through several continual federated scenarios on both image and binary datasets.
Paper Structure (17 sections, 7 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 7 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Set-up schematic. As data (on clients) may drift individually over time, models need to mitigate forgetting. In distributed scenarios, being additionally informed through indirect experience (dashed red arrows) provides further learning benefit, as other clients may observe similar data at different points in time.
  • Figure 2: Proposed CONFEDMADE framework. On the left, we show a visual representation of a masked autoencoder architecture based on the MADE masking rule in eq. 4. The central server, on the right, initializes the global model and distributes it to all the participating clients along with the masking variables $M^W$ and federated mask $m_c$. In the client module, masked model parameters ($W_G * M^W * m_c$) refers to the local client network (left figure) which are then additively decomposed into task-specific $A_c$ and local-base parameters $B_c$ to obtain the final set of trainable parameters. After each round of training $r$, clients communicate local-base parameters (in dotted lines) to the server whereas it only communicates task-adaptive parameters after completing r rounds of training (in solid lines). The central server computes the averaged weighted parameter via Fed-Avg and communicates it back to the clients at the beginning of each round, whereas it stores $A_c$ into the knowledge base and communicates only at the beginning of each new task $t$.
  • Figure 3: Decomposed negative log-likelihood in CFL to showcase: (left) an average of all the tasks seen so far, (center) the "base" task loss, i.e, the value for the only initial task in evolution over time to assess forgetting, (right) the "new" task loss, i.e., the value for only the newest task to gauge encoding of new knowledge. Lower values are better.
  • Figure 4: Heatmaps for values of $\alpha$ (range 0 to 1) to highlight inter-client knowledge transfer when other clients have observed the same tasks (left) or have partial overlap (right). Two tasks for individual clients are denoted in brackets on x and y-axes respectively.
  • Figure 5: Decomposed negative log-likelihood in FCL for Binary datasets to showcase: (left) average of tasks seen so far, (center) the "base" loss, i.e the value for only initial task in evolution over time to assess forgetting, (right) the "new" loss, i.e. the value for only the newest task to gauge encoding of new knowledge. Lower values are better.