Table of Contents
Fetching ...

FedRSU: Federated Learning for Scene Flow Estimation on Roadside Units

Shaoheng Fang, Rui Ye, Wenhao Wang, Zuhong Liu, Yuxiao Wang, Yafei Wang, Siheng Chen, Yanfeng Wang

TL;DR

FedRSU presents a federated, self-supervised, multi-modal framework for clutter-free scene flow estimation on roadside units (RSUs). By combining recurrent self-supervision with FL, RSUs collaboratively train a LiDAR-camera scene flow model without sharing raw data, leveraging optical-flow-guided multi-modal losses. The authors introduce RSU-SF, a large real-world dataset across 17 RSUs to benchmark FL in ITS, and demonstrate gains over local and several FL baselines in generalized and personalized settings. They also discuss practical deployment considerations, including modality trade-offs and future research directions to address data heterogeneity and vehicle-data integration, highlighting FedRSU as a scalable pathway for ITS perception enhancement.

Abstract

Roadside unit (RSU) can significantly improve the safety and robustness of autonomous vehicles through Vehicle-to-Everything (V2X) communication. Currently, the usage of a single RSU mainly focuses on real-time inference and V2X collaboration, while neglecting the potential value of the high-quality data collected by RSU sensors. Integrating the vast amounts of data from numerous RSUs can provide a rich source of data for model training. However, the absence of ground truth annotations and the difficulty of transmitting enormous volumes of data are two inevitable barriers to fully exploiting this hidden value. In this paper, we introduce FedRSU, an innovative federated learning framework for self-supervised scene flow estimation. In FedRSU, we present a recurrent self-supervision training paradigm, where for each RSU, the scene flow prediction of points at every timestamp can be supervised by its subsequent future multi-modality observation. Another key component of FedRSU is federated learning, where multiple devices collaboratively train an ML model while keeping the training data local and private. With the power of the recurrent self-supervised learning paradigm, FL is able to leverage innumerable underutilized data from RSU. To verify the FedRSU framework, we construct a large-scale multi-modality dataset RSU-SF. The dataset consists of 17 RSU clients, covering various scenarios, modalities, and sensor settings. Based on RSU-SF, we show that FedRSU can greatly improve model performance in ITS and provide a comprehensive benchmark under diverse FL scenarios. To the best of our knowledge, we provide the first real-world LiDAR-camera multi-modal dataset and benchmark for the FL community.

FedRSU: Federated Learning for Scene Flow Estimation on Roadside Units

TL;DR

FedRSU presents a federated, self-supervised, multi-modal framework for clutter-free scene flow estimation on roadside units (RSUs). By combining recurrent self-supervision with FL, RSUs collaboratively train a LiDAR-camera scene flow model without sharing raw data, leveraging optical-flow-guided multi-modal losses. The authors introduce RSU-SF, a large real-world dataset across 17 RSUs to benchmark FL in ITS, and demonstrate gains over local and several FL baselines in generalized and personalized settings. They also discuss practical deployment considerations, including modality trade-offs and future research directions to address data heterogeneity and vehicle-data integration, highlighting FedRSU as a scalable pathway for ITS perception enhancement.

Abstract

Roadside unit (RSU) can significantly improve the safety and robustness of autonomous vehicles through Vehicle-to-Everything (V2X) communication. Currently, the usage of a single RSU mainly focuses on real-time inference and V2X collaboration, while neglecting the potential value of the high-quality data collected by RSU sensors. Integrating the vast amounts of data from numerous RSUs can provide a rich source of data for model training. However, the absence of ground truth annotations and the difficulty of transmitting enormous volumes of data are two inevitable barriers to fully exploiting this hidden value. In this paper, we introduce FedRSU, an innovative federated learning framework for self-supervised scene flow estimation. In FedRSU, we present a recurrent self-supervision training paradigm, where for each RSU, the scene flow prediction of points at every timestamp can be supervised by its subsequent future multi-modality observation. Another key component of FedRSU is federated learning, where multiple devices collaboratively train an ML model while keeping the training data local and private. With the power of the recurrent self-supervised learning paradigm, FL is able to leverage innumerable underutilized data from RSU. To verify the FedRSU framework, we construct a large-scale multi-modality dataset RSU-SF. The dataset consists of 17 RSU clients, covering various scenarios, modalities, and sensor settings. Based on RSU-SF, we show that FedRSU can greatly improve model performance in ITS and provide a comprehensive benchmark under diverse FL scenarios. To the best of our knowledge, we provide the first real-world LiDAR-camera multi-modal dataset and benchmark for the FL community.
Paper Structure (25 sections, 8 equations, 7 figures, 9 tables)

This paper contains 25 sections, 8 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: FedRSU system overview, where multiple roadside units (RSUs) collaboratively train a scene flow estimation model without transmitting raw data under the coordination of a cloud server. Iteratively, each RSU trains a local model in a self-supervised manner, and the server aggregates local models. FedRSU can significantly alleviate the challenges of tedious labeling and limited data for one single RSU.
  • Figure 2: The recurrent self-supervised learning paradigm. The prediction of the model can be supervised by the following frame of sensor data in a self-supervised manner. With the continuous data stream, the model can be continuously improved.
  • Figure 3: Overview of FedRSU framework. FedRSU consists of four steps. 1) The server sends the global model to all available clients, 2) each client updates local model supervised by Chamfer loss and smoothness regularization, 3) each client sends local model to the server, 4) the server updates global model by aggregating received local models. These four steps will iterate for multiple rounds.
  • Figure 4: Overview of the scene flow model architecture. We follow the architecture of Flowstep3d Flowstep3d and predict the scene flow in a coarse to fine manner.
  • Figure 5: Various RSU settings in four base datasets: DAIR-V2X Dair-v2x, LUMPI busch2022lumpi, IPS300+ wang2022ips300+, and our collected campus dataset. DAIR-V2X collects data from normal traffic crossroads, LUMPI and IPS300+ collect data from busy intersections while the Campus dataset consists of sparse traffic on campus. Data from different RSU clients are highly diverse according to different sensor devices, sensor deployment, and various scenarios.
  • ...and 2 more figures