Table of Contents
Fetching ...

Exploring Surround-View Fisheye Camera 3D Object Detection

Changcai Li, Wenwei Lin, Zuoxun Hou, Gang Chen, Wei Zhang, Huihui Zhou, Weishi Zheng

TL;DR

This work investigates the technical feasibility of implementing end-to-end 3D object detection (3DOD) with surround-view fisheye camera system, and develops two methods that incorporate the unique geometry of fisheye images into mainstream detection frameworks.

Abstract

In this work, we explore the technical feasibility of implementing end-to-end 3D object detection (3DOD) with surround-view fisheye camera system. Specifically, we first investigate the performance drop incurred when transferring classic pinhole-based 3D object detectors to fisheye imagery. To mitigate this, we then develop two methods that incorporate the unique geometry of fisheye images into mainstream detection frameworks: one based on the bird's-eye-view (BEV) paradigm, named FisheyeBEVDet, and the other on the query-based paradigm, named FisheyePETR. Both methods adopt spherical spatial representations to effectively capture fisheye geometry. In light of the lack of dedicated evaluation benchmarks, we release Fisheye3DOD, a new open dataset synthesized using CARLA and featuring both standard pinhole and fisheye camera arrays. Experiments on Fisheye3DOD show that our fisheye-compatible modeling improves accuracy by up to 6.2% over baseline methods.

Exploring Surround-View Fisheye Camera 3D Object Detection

TL;DR

This work investigates the technical feasibility of implementing end-to-end 3D object detection (3DOD) with surround-view fisheye camera system, and develops two methods that incorporate the unique geometry of fisheye images into mainstream detection frameworks.

Abstract

In this work, we explore the technical feasibility of implementing end-to-end 3D object detection (3DOD) with surround-view fisheye camera system. Specifically, we first investigate the performance drop incurred when transferring classic pinhole-based 3D object detectors to fisheye imagery. To mitigate this, we then develop two methods that incorporate the unique geometry of fisheye images into mainstream detection frameworks: one based on the bird's-eye-view (BEV) paradigm, named FisheyeBEVDet, and the other on the query-based paradigm, named FisheyePETR. Both methods adopt spherical spatial representations to effectively capture fisheye geometry. In light of the lack of dedicated evaluation benchmarks, we release Fisheye3DOD, a new open dataset synthesized using CARLA and featuring both standard pinhole and fisheye camera arrays. Experiments on Fisheye3DOD show that our fisheye-compatible modeling improves accuracy by up to 6.2% over baseline methods.

Paper Structure

This paper contains 17 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Left: The pinhole camera setup has blind spots in the near field, whereas the fisheye camera provides enhanced coverage for improved safety. Right: The same object is captured from multiple fisheye viewpoints.
  • Figure 2: Camera layouts for surround-view perception.
  • Figure 3: Left: The horizontal axis indicates the 3D distance from the ego vehicle. The vertical axis indicates the ratio between the largest projected 2D bounding box size of an object in any fisheye camera and that in any pinhole camera. The points in the figure correspond to 100 samples per category, with the curves fitted using LOWESS cleveland1979robust. Right: Illustration of pixel compression in fisheye and pinhole images. The same object occupies approximately $70 \times 80$ pixels in the pinhole image, but only about $22 \times 26$ pixels in the fisheye image. The pixel area in the fisheye image is roughly 0.1 times that of the pinhole image.
  • Figure 4: The architecture of the proposed methods. (a): Multi-view fisheye images are processed by a shared backbone, and their features are projected into an equirectangular representation via the projection function $\Pi$. (b): In FisheyeBEVDet, the projected 2D features are lifted onto a 3D spherical grid to construct a BEV representation. (c): In FisheyePETR, the 2D features are encoded with spherical coordinates and interact with object queries through multi-head cross-attention (MHCA).
  • Figure 5: We visualize the predictions in LiDAR point clouds for a clearer comparison. The first row shows a sparse traffic scenario with isolated vehicles on open roads, whereas the second row depicts a dense urban junction with multi-object occlusion.
  • ...and 1 more figures