Table of Contents
Fetching ...

FedCVU: Federated Learning for Cross-View Video Understanding

Shenghan Zhang, Run Ling, Ke Cao, Ao Ma, Zhanjie Zhang

Abstract

Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.

FedCVU: Federated Learning for Cross-View Video Understanding

Abstract

Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.
Paper Structure (20 sections, 13 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 13 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Convergence curves on MCAD (Top-1) and MARS (mAP). FedCVU converges smoothly within 112–123 rounds and achieves the highest accuracy, while FedAvg/FedProx plateau early at low performance.
  • Figure 2: Strong synchronization frequency across Transformer blocks on MCAD and MARS. MCAD exhibits a U-shaped pattern where shallow and deep blocks are more frequently synchronized, while mid-level blocks remain mostly localized due to higher view-specific variability. In contrast, MARS shows consistently higher synchronization in deeper blocks, reflecting the stability of identity semantics across cameras.