Table of Contents
Fetching ...

TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes

Yan Xia, Yunxiang Lu, Rui Song, Oussema Dhaouadi, João F. Henriques, Daniel Cremers

TL;DR

TrafficLoc addresses localizing traffic surveillance cameras within a 3D reference map by learning a coarse-to-fine image-to-point-cloud registration with cross-modal attention guided by geometry. It introduces Geometry-guided Attention Loss (GAL), Inter-intra Contrastive Learning (ICL), and Dense Training Alignment (DTA) to strengthen 2D-3D correspondence under large viewpoint changes, enabling robust 6-DoF pose estimation via EPnP-RANSAC. The approach is validated on the newly proposed Carla Intersection dataset (75 intersections across 8 worlds) and generalizes to KITTI and Nuscenes, achieving state-of-the-art localization accuracy and improved cross-domain performance, including challenging unseen scenes. The work provides a practical, scalable framework for cooperative perception in city-scale camera networks, with the Carla Intersection dataset and supplementary materials facilitating further research.

Abstract

We tackle the problem of localizing traffic cameras within a 3D reference map and propose a novel image-to-point cloud registration (I2P) method, TrafficLoc, in a coarse-tofine matching fashion. To overcome the lack of large-scale real-world intersection datasets, we first introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. We find that current I2P methods struggle with cross-modal matching under large viewpoint differences, especially at traffic intersections. TrafficLoc thus employs a novel Geometry-guided Attention Loss (GAL) to focus only on the corresponding geometric regions under different viewpoints during 2D-3D feature fusion. To address feature inconsistency in paired image patch-point groups, we further propose Inter-intra Contrastive Learning (ICL) to enhance separating 2D patch/3D group features within each intra-modality and introduce Dense Training Alignment (DTA) with soft-argmax for improving position regression. Extensive experiments show our TrafficLoc greatly improves the performance over the SOTA I2P methods (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating the superiority across both in-vehicle and traffic cameras. Our project page is publicly available at https://tum-luk.github.io/projects/trafficloc/.

TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes

TL;DR

TrafficLoc addresses localizing traffic surveillance cameras within a 3D reference map by learning a coarse-to-fine image-to-point-cloud registration with cross-modal attention guided by geometry. It introduces Geometry-guided Attention Loss (GAL), Inter-intra Contrastive Learning (ICL), and Dense Training Alignment (DTA) to strengthen 2D-3D correspondence under large viewpoint changes, enabling robust 6-DoF pose estimation via EPnP-RANSAC. The approach is validated on the newly proposed Carla Intersection dataset (75 intersections across 8 worlds) and generalizes to KITTI and Nuscenes, achieving state-of-the-art localization accuracy and improved cross-domain performance, including challenging unseen scenes. The work provides a practical, scalable framework for cooperative perception in city-scale camera networks, with the Carla Intersection dataset and supplementary materials facilitating further research.

Abstract

We tackle the problem of localizing traffic cameras within a 3D reference map and propose a novel image-to-point cloud registration (I2P) method, TrafficLoc, in a coarse-tofine matching fashion. To overcome the lack of large-scale real-world intersection datasets, we first introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. We find that current I2P methods struggle with cross-modal matching under large viewpoint differences, especially at traffic intersections. TrafficLoc thus employs a novel Geometry-guided Attention Loss (GAL) to focus only on the corresponding geometric regions under different viewpoints during 2D-3D feature fusion. To address feature inconsistency in paired image patch-point groups, we further propose Inter-intra Contrastive Learning (ICL) to enhance separating 2D patch/3D group features within each intra-modality and introduce Dense Training Alignment (DTA) with soft-argmax for improving position regression. Extensive experiments show our TrafficLoc greatly improves the performance over the SOTA I2P methods (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating the superiority across both in-vehicle and traffic cameras. Our project page is publicly available at https://tum-luk.github.io/projects/trafficloc/.

Paper Structure

This paper contains 22 sections, 15 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Localization accuracy on the proposed Carla Intersection and KITTI dataset. The point cloud is projected into a 2D view and shown above the image, with point colors indicating distance. The proposed TrafficLoc achieves better performance, with more correct (green) and fewer incorrect (red) point-to-pixel pairs. The first column presents the input point cloud and input image.
  • Figure 2: Pipeline of our TrafficLoc. Given a traffic camera image and a 3D scene point cloud collected at different locations, we first extract features at the point group level and image patch level, respectively. We then fuse them using a Geometry-guided Feature Fusion (GFF) module and match them based on similarity rules. Furthermore, we perform fine matching between the point group center and the extracted image window with a soft-argmax operation. Finally, we use EPnP-RANSAC epnppnpsolver algorithm to get the final camera pose based on the predicted 2D-3D correspondences.
  • Figure 3: The pipeline of Geometry-guided Feature Fusion (GFF) module. GFF first use $N_c$ layers of self and cross-attention module to enhance the feature across different modalities (left). The proposed Geometry-guided Attention Loss is applied to the cross-attention map of the last fusion layer based on camera projection geometry (right).
  • Figure 4: Coarse matching mechanism of TrafficLoc. The positive feature pairs are generated based on ground-truth transformation matrix. The coarse image feature ${F}^{coarse}_I$ is reshaped to compute its similarity map with each coarse point feature.
  • Figure 5: Localization performance of our TrafficLoc on the USTC intersection dataset sheng2024rendering. Note that the model is trained on the synthetic Carla Intersection dataset.
  • ...and 7 more figures