Table of Contents
Fetching ...

FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation

Ding-Ruei Shen

TL;DR

FFREEDG tackles semantic segmentation under strict privacy constraints by combining a source-free federated setting with vision–language priors from CLIP. Frieren enables centralized pretraining on labeled source data, followed by federated adaptation where clients contribute unlabeled data, leveraging weak-to-strong consistency, dense CLIP distillation, and a language-guided decoder. The method demonstrates competitive performance against domain generalization and domain adaptation baselines on Cityscapes→ACDC and GTA5→Cityscapes, with FedSWA providing stability in unsupervised federation. This work advances practical privacy-preserving segmentation by showing how foundation-model priors and unified semi-/unsupervised learning can generalize to unseen domains without accessing source or target labels. It lays a foundation for future integration of larger vision-language models and alternative decoders to further close the gap to state-of-the-art DG/DA methods.

Abstract

Federeated Learning (FL) offers a privacy-preserving solution for Semantic Segmentation (SS) tasks to adapt to new domains, but faces significant challenges from these domain shifts, particularly when client data is unlabeled. However, most existing FL methods unrealistically assume access to labeled data on remote clients or fail to leverage the power of modern Vision Foundation Models (VFMs). Here, we propose a novel and challenging task, FFREEDG, in which a model is pretrained on a server's labeled source dataset and subsequently trained across clients using only their unlabeled data, without ever re-accessing the source. To solve FFREEDG, we propose FRIEREN, a framework that leverages the knowledge of a VFM by integrating vision and language modalities. Our approach employs a Vision-Language decoder guided by CLIP-based text embeddings to improve semantic disambiguation and uses a weak-to-strong consistency learning strategy for robust local training on pseudo-labels. Our experiments on synthetic-to-real and clear-to-adverse-weather benchmarks demonstrate that our framework effectively tackles this new task, achieving competitive performance against established domain generalization and adaptation methods and setting a strong baseline for future research.

FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation

TL;DR

FFREEDG tackles semantic segmentation under strict privacy constraints by combining a source-free federated setting with vision–language priors from CLIP. Frieren enables centralized pretraining on labeled source data, followed by federated adaptation where clients contribute unlabeled data, leveraging weak-to-strong consistency, dense CLIP distillation, and a language-guided decoder. The method demonstrates competitive performance against domain generalization and domain adaptation baselines on Cityscapes→ACDC and GTA5→Cityscapes, with FedSWA providing stability in unsupervised federation. This work advances practical privacy-preserving segmentation by showing how foundation-model priors and unified semi-/unsupervised learning can generalize to unseen domains without accessing source or target labels. It lays a foundation for future integration of larger vision-language models and alternative decoders to further close the gap to state-of-the-art DG/DA methods.

Abstract

Federeated Learning (FL) offers a privacy-preserving solution for Semantic Segmentation (SS) tasks to adapt to new domains, but faces significant challenges from these domain shifts, particularly when client data is unlabeled. However, most existing FL methods unrealistically assume access to labeled data on remote clients or fail to leverage the power of modern Vision Foundation Models (VFMs). Here, we propose a novel and challenging task, FFREEDG, in which a model is pretrained on a server's labeled source dataset and subsequently trained across clients using only their unlabeled data, without ever re-accessing the source. To solve FFREEDG, we propose FRIEREN, a framework that leverages the knowledge of a VFM by integrating vision and language modalities. Our approach employs a Vision-Language decoder guided by CLIP-based text embeddings to improve semantic disambiguation and uses a weak-to-strong consistency learning strategy for robust local training on pseudo-labels. Our experiments on synthetic-to-real and clear-to-adverse-weather benchmarks demonstrate that our framework effectively tackles this new task, achieving competitive performance against established domain generalization and adaptation methods and setting a strong baseline for future research.

Paper Structure

This paper contains 49 sections, 14 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Federated source-Free Domain Generalization (FFREEDG) framework. A central server coordinates the training of a global model with multiple clients. Each client has access only to its own private, unlabeled data, which comes from a distinct domain, such as different weather or lighting conditions. The clients perform local, unsupervised training before the server aggregates their updates to create a single model that can generalize across these diverse environments.
  • Figure 2: Overview of FRIEREN. Mask denotes a dropout mechanism (e.g., complementary dropout yang2025unimatch) that generates perturbed encoder features for consistency learning. (a) Server pretraining: a student–teacher setup with EMA updates on the teacher, supervised loss on labeled source images, weak–strong consistency on unlabeled images, and dense CLIP-based segmentation guidance via the language-guided decoder. Learnable modules and frozen modules are indicated in the figure. (b) Federated local training: after pretraining, each selected client receives the global model and performs (semi-)unsupervised self-training with a frozen teacher, consistency loss, and dense CLIP guidance; labeled client data (if any) uses an additional supervised term. (c) Communication and aggregation: the server broadcasts $w^{t}$, collects client updates $w_{k}^{t+1}$, and aggregates them; we use FedSWA liu2025fedswa for unsupervised clients and FedAvg mcmahan2017communication when supervision is available.
  • Figure 3: Vision–language decoder in FRIEREN. FRIEREN adopts the language-guided decoder from SemiVL hoyer2024semivl. Class prompts are embedded by a frozen text encoder and paired with features from a trainable image encoder to form a dense similarity map. The decoder performs spatial reasoning followed by semantic reasoning, and a convolutional upsampler produces the final segmentation mask.