Table of Contents
Fetching ...

When the City Teaches the Car: Label-Free 3D Perception from Infrastructure

Zhen Xu, Jinsu Yoo, Cristian Bautista, Zanming Huang, Tai-Yu Pan, Zhenzhen Liu, Katie Z Luo, Mark Campbell, Bharath Hariharan, Wei-Lun Chao

Abstract

Building robust 3D perception for self-driving still relies heavily on large-scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure-taught, label-free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo-label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label-free three-stage pipeline and conduct a concept-and-feasibility study in a CARLA-based multi-agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego-centric label-free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure-taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.

When the City Teaches the Car: Label-Free 3D Perception from Infrastructure

Abstract

Building robust 3D perception for self-driving still relies heavily on large-scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure-taught, label-free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo-label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label-free three-stage pipeline and conduct a concept-and-feasibility study in a CARLA-based multi-agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego-centric label-free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure-taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.
Paper Structure (30 sections, 20 figures, 11 tables, 1 algorithm)

This paper contains 30 sections, 20 figures, 11 tables, 1 algorithm.

Figures (20)

  • Figure 1: Can city infrastructure teach vehicles to perceive? We explore a new paradigm where roadside infrastructure acts as distributed teachers, providing supervision to train ego perception models without manual annotations.
  • Figure 2: Overview of infrastructure-taught, label-free 3D perception. Stage 1: each RSU learns a location-specialized detector in an unsupervised manner by exploiting temporal consistency from its stationary viewpoint. Stage 2: trained RSUs broadcast their predicted 3D bounding boxes to nearby ego vehicles when their fields of view overlap. Stage 3: the ego vehicle aggregates these predictions as pseudo-labels to train its own detector offline, producing a standalone ego model that no longer requires infrastructure at deployment time.
  • Figure 3: Sample from the CIVET dataset used in Stage 2 RSU-to-ego broadcasting. The ego vehicle and RSU observe the same traffic scene from different viewpoints with overlapping fields of view.
  • Figure 4: Effectiveness of PP scores for RSU. (a) Discriminative distribution allows a clear separation between static background and objects. (b) Pseudo-labels exhibit high localization quality.
  • Figure 5: Effect of tracking refinement zhang2023oyster. Incorporating tracking improves pseudo-label recall, yielding stronger supervision for unsupervised RSU training.
  • ...and 15 more figures