FedCVU: Federated Learning for Cross-View Video Understanding

Shenghan Zhang; Run Ling; Ke Cao; Ao Ma; Zhanjie Zhang

FedCVU: Federated Learning for Cross-View Video Understanding

Shenghan Zhang, Run Ling, Ke Cao, Ao Ma, Zhanjie Zhang

Abstract

Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.

FedCVU: Federated Learning for Cross-View Video Understanding

Abstract

Paper Structure (20 sections, 13 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 13 equations, 2 figures, 3 tables, 1 algorithm.

Introduction
Related Works
Federated Learning
Cross-View Video Understanding
Method
Problem Formulation
Federated Framework: FedCVU
View-Specific Normalization
Cross-View Contrastive Alignment
Selective Layer Aggregation
Experiments
Experimental Setup
Overall Performance Comparison
Ablation Studies
Analysis of Synchronization Frequency
...and 5 more sections

Figures (2)

Figure 1: Convergence curves on MCAD (Top-1) and MARS (mAP). FedCVU converges smoothly within 112–123 rounds and achieves the highest accuracy, while FedAvg/FedProx plateau early at low performance.
Figure 2: Strong synchronization frequency across Transformer blocks on MCAD and MARS. MCAD exhibits a U-shaped pattern where shallow and deep blocks are more frequently synchronized, while mid-level blocks remain mostly localized due to higher view-specific variability. In contrast, MARS shows consistently higher synchronization in deeper blocks, reflecting the stability of identity semantics across cameras.

FedCVU: Federated Learning for Cross-View Video Understanding

Abstract

FedCVU: Federated Learning for Cross-View Video Understanding

Authors

Abstract

Table of Contents

Figures (2)