Table of Contents
Fetching ...

FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

Hui Lin, Chao Zhang, Danfeng Hong, Kexin Dong, Congcong Wen

TL;DR

FedRSCLIP introduces a federated framework for remote sensing scene classification that leverages Vision-Language Models (CLIP) while addressing high communication costs through Prompt Learning. By deploying a dual-prompt mechanism (shared and private prompts) and two alignment constraints (Dual Prompt Alignment Constraint and Cross-Modal Feature Alignment Constraint), the method achieves robust global generalization and local adaptability across non-IID client data. Experiments on the Fed-RSIC benchmark (Fed-Optimal, Fed-UCMerced, Fed-NWPU) show state-of-the-art accuracy with dramatically reduced transmitted parameters (as low as 2,048) and strong ablation results confirming the effectiveness of prompts and alignment losses. The work demonstrates the practical viability of integrating VLMs into federated RS pipelines, laying a foundation for cross-modal, privacy-preserving RS analytics with scalable communication.

Abstract

Remote sensing data is often distributed across multiple institutions, and due to privacy concerns and data-sharing restrictions, leveraging large-scale datasets in a centralized training framework is challenging. Federated learning offers a promising solution by enabling collaborative model training across distributed data sources without requiring data centralization. However, current Vision-Language Models (VLMs), which typically contain billions of parameters, pose significant communication challenges for traditional federated learning approaches based on model parameter updates, as they would incur substantial communication costs. In this paper, we propose FedRSCLIP, the first federated learning framework designed for remote sensing image classification based on a VLM, specifically CLIP. FedRSCLIP addresses the challenges of data heterogeneity and large-scale model transmission in federated environments by introducing Prompt Learning, which optimizes only a small set of tunable parameters. The framework introduces a dual-prompt mechanism, comprising Shared Prompts for global knowledge sharing and Private Prompts for client-specific adaptation. To maintain semantic coherence between shared and private prompts, we propose the Dual Prompt Alignment Constraint to balance global consistency and local adaptability across diverse client distributions. Additionally, to enhance cross-modal representation learning, we introduce the Cross-Modal Feature Alignment Constraint to align multimodal features between text and image prompts. To validate the effectiveness of our proposed model, we construct a Fed-RSIC dataset based on three existing remote sensing image classification datasets, specifically designed to simulate various federated learning configurations. Experimental results demonstrate the effectiveness and superiority of FedRSCLIP in remote sensing image classification.

FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

TL;DR

FedRSCLIP introduces a federated framework for remote sensing scene classification that leverages Vision-Language Models (CLIP) while addressing high communication costs through Prompt Learning. By deploying a dual-prompt mechanism (shared and private prompts) and two alignment constraints (Dual Prompt Alignment Constraint and Cross-Modal Feature Alignment Constraint), the method achieves robust global generalization and local adaptability across non-IID client data. Experiments on the Fed-RSIC benchmark (Fed-Optimal, Fed-UCMerced, Fed-NWPU) show state-of-the-art accuracy with dramatically reduced transmitted parameters (as low as 2,048) and strong ablation results confirming the effectiveness of prompts and alignment losses. The work demonstrates the practical viability of integrating VLMs into federated RS pipelines, laying a foundation for cross-modal, privacy-preserving RS analytics with scalable communication.

Abstract

Remote sensing data is often distributed across multiple institutions, and due to privacy concerns and data-sharing restrictions, leveraging large-scale datasets in a centralized training framework is challenging. Federated learning offers a promising solution by enabling collaborative model training across distributed data sources without requiring data centralization. However, current Vision-Language Models (VLMs), which typically contain billions of parameters, pose significant communication challenges for traditional federated learning approaches based on model parameter updates, as they would incur substantial communication costs. In this paper, we propose FedRSCLIP, the first federated learning framework designed for remote sensing image classification based on a VLM, specifically CLIP. FedRSCLIP addresses the challenges of data heterogeneity and large-scale model transmission in federated environments by introducing Prompt Learning, which optimizes only a small set of tunable parameters. The framework introduces a dual-prompt mechanism, comprising Shared Prompts for global knowledge sharing and Private Prompts for client-specific adaptation. To maintain semantic coherence between shared and private prompts, we propose the Dual Prompt Alignment Constraint to balance global consistency and local adaptability across diverse client distributions. Additionally, to enhance cross-modal representation learning, we introduce the Cross-Modal Feature Alignment Constraint to align multimodal features between text and image prompts. To validate the effectiveness of our proposed model, we construct a Fed-RSIC dataset based on three existing remote sensing image classification datasets, specifically designed to simulate various federated learning configurations. Experimental results demonstrate the effectiveness and superiority of FedRSCLIP in remote sensing image classification.
Paper Structure (29 sections, 15 equations, 3 figures, 7 tables)

This paper contains 29 sections, 15 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Illustration of Federated Learning Using Vision-Language Models (VLMs) in Remote Sensing Tasks. Remote sensing data is typically distributed across different institutions, with privacy concerns and data-sharing restrictions. Traditional federated learning involves uploading each client’s local model parameters to a central server for unified updates, which are then sent back to each client. However, for each client using VLM, transmitting billions of parameters each time leads to heavy communication and bandwidth costs.
  • Figure 2: Illustration of FedRSCLIP's federated learning framework for remote sensing image classification using VLMs across multiple clients. Each client contains its own VLM, processing both image and text prompts. The framework utilizes a dual prompt mechanism, with Shared Prompts (Prompt$^S$) for global knowledge sharing across clients, and Private Prompts (Prompt$^P$) tailored to each client's unique data distribution. The Dual Prompt Alignment Constraint (yellow rectangle with black dashed border) ensures alignment between shared and private prompts, while Cross-Modal Feature Alignment Constraint (cyan rectangle with black dashed border) aligns textual and image features within each client to capture meaningful multimodal information. The server facilitates global updates by aggregating shared prompts (P$_S$) from all clients to improve the model's generalization capabilities across heterogeneous data environments.
  • Figure 3: Classification results of FedRSClip across 10 clients on the Fed-Optimal dataset. Six representative classes are randomly selected for visualization: Airplane, Desert, Forest, Harbor, Parking Lot, and Roundabout. Each column corresponds to a different client, and the rows represent the predicted class labels. Correct classifications are shown in their respective rows, while misclassified samples are highlighted with green borders.