Table of Contents
Fetching ...

FORLA: Federated Object-centric Representation Learning with Slot Attention

Guiqiu Liao, Matjaz Jogan, Eric Eaton, Daniel A. Hashimoto

TL;DR

This work introduces FORLA, a novel framework for federated object-centric representation learning and feature adaptation across clients using unsupervised slot attention, and highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.

Abstract

Learning efficient visual representations across heterogeneous unlabeled datasets remains a central challenge in federated learning. Effective federated representations require features that are jointly informative across clients while disentangling domain-specific factors without supervision. We introduce FORLA, a novel framework for federated object-centric representation learning and feature adaptation across clients using unsupervised slot attention. At the core of our method is a shared feature adapter, trained collaboratively across clients to adapt features from foundation models, and a shared slot attention module that learns to reconstruct the adapted features. To optimize this adapter, we design a two-branch student-teacher architecture. In each client, a student decoder learns to reconstruct full features from foundation models, while a teacher decoder reconstructs their adapted, low-dimensional counterpart. The shared slot attention module bridges cross-domain learning by aligning object-level representations across clients. Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. This work highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.

FORLA: Federated Object-centric Representation Learning with Slot Attention

TL;DR

This work introduces FORLA, a novel framework for federated object-centric representation learning and feature adaptation across clients using unsupervised slot attention, and highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.

Abstract

Learning efficient visual representations across heterogeneous unlabeled datasets remains a central challenge in federated learning. Effective federated representations require features that are jointly informative across clients while disentangling domain-specific factors without supervision. We introduce FORLA, a novel framework for federated object-centric representation learning and feature adaptation across clients using unsupervised slot attention. At the core of our method is a shared feature adapter, trained collaboratively across clients to adapt features from foundation models, and a shared slot attention module that learns to reconstruct the adapted features. To optimize this adapter, we design a two-branch student-teacher architecture. In each client, a student decoder learns to reconstruct full features from foundation models, while a teacher decoder reconstructs their adapted, low-dimensional counterpart. The shared slot attention module bridges cross-domain learning by aligning object-level representations across clients. Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. This work highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.

Paper Structure

This paper contains 52 sections, 6 equations, 7 figures, 14 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of FORLA. Left: Within each client, student and teacher branches are trained to reconstruct raw foundation model features and adapted features, respectively. Right: During each global federated learning (FL) round, the student's adapter and Slot Attention (SA) modules are aggregated across clients via the server. In later stages of training, the teacher's adapter and SA modules are locally synchronized with the student through a local FedAvg update, enabling progressive knowledge distillation.
  • Figure 2: Visualization of PCA maps and Slot Attention (SA) masks from different methods. The middle three rows show the first three PCA components visualized using RGB channels for frozen foundation model features and adapted features produced by the AFM module trained under centralized training and under FORLA. The last two rows illustrate the scene decomposition ability of each method via the SA-generated masks.
  • Figure 3: mBO and Corloc for centralized training and FORLA across different data combinations. Data Comb: Data combinations. Individualized training is presented as baseline here.
  • Figure 4: Inference using RNN like slot initialization zadaianchuk2024object on YTOBJ videos. We compared to individualized trained SA models on YTOBJ using adapted single foundation model including DINO (as used in DINOSAUR seitzer2022bridging) and SAM.
  • Figure 5: Additional results on YTOBJ videos compared to individualized trained SA models with single foundation model adaptation.
  • ...and 2 more figures