Table of Contents
Fetching ...

Cross-Vehicle 3D Geometric Consistency for Self-Supervised Surround Depth Estimation on Articulated Vehicles

Weimin Liu, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng

Abstract

Surround depth estimation provides a cost-effective alternative to LiDAR for 3D perception in autonomous driving. While recent self-supervised methods explore multi-camera settings to improve scale awareness and scene coverage, they are primarily designed for passenger vehicles and rarely consider articulated vehicles or robotics platforms. The articulated structure introduces complex cross-segment geometry and motion coupling, making consistent depth reasoning across views more challenging. In this work, we propose \textbf{ArticuSurDepth}, a self-supervised framework for surround-view depth estimation on articulated vehicles that enhances depth learning through cross-view and cross-vehicle geometric consistency guided by structural priors from vision foundation model. Specifically, we introduce multi-view spatial context enrichment strategy and a cross-view surface normal constraint to improve structural coherence across spatial and temporal contexts. We further incorporate camera height regularization with ground plane-awareness to encourage metric depth estimation, together with cross-vehicle pose consistency that bridges motion estimation between articulated segments. To validate our proposed method, an articulated vehicle experiment platform was established with a dataset collected over it. Experiment results demonstrate state-of-the-art (SoTA) performance of depth estimation on our self-collected dataset as well as on DDAD, nuScenes, and KITTI benchmarks.

Cross-Vehicle 3D Geometric Consistency for Self-Supervised Surround Depth Estimation on Articulated Vehicles

Abstract

Surround depth estimation provides a cost-effective alternative to LiDAR for 3D perception in autonomous driving. While recent self-supervised methods explore multi-camera settings to improve scale awareness and scene coverage, they are primarily designed for passenger vehicles and rarely consider articulated vehicles or robotics platforms. The articulated structure introduces complex cross-segment geometry and motion coupling, making consistent depth reasoning across views more challenging. In this work, we propose \textbf{ArticuSurDepth}, a self-supervised framework for surround-view depth estimation on articulated vehicles that enhances depth learning through cross-view and cross-vehicle geometric consistency guided by structural priors from vision foundation model. Specifically, we introduce multi-view spatial context enrichment strategy and a cross-view surface normal constraint to improve structural coherence across spatial and temporal contexts. We further incorporate camera height regularization with ground plane-awareness to encourage metric depth estimation, together with cross-vehicle pose consistency that bridges motion estimation between articulated segments. To validate our proposed method, an articulated vehicle experiment platform was established with a dataset collected over it. Experiment results demonstrate state-of-the-art (SoTA) performance of depth estimation on our self-collected dataset as well as on DDAD, nuScenes, and KITTI benchmarks.

Paper Structure

This paper contains 27 sections, 28 equations, 10 figures, 6 tables.

Figures (10)

  • Figure A1: Surround depth estimation for articulated vehicle.
  • Figure C1: Overview: (a) Network architecture of ArticuSurDepth; (b) Self-supervised training framework and its loss components: (Left) Within- and cross-vehicle spatial context enrichment. Example: for the target view $C_5$, the within-vehicle right view is $C_6$, while the type-2 cross-vehicle right view is $C_0$. (Right) Cross-view pseudo surface normal consistency ($\mathcal{L}_\text{PNC}$).
  • Figure D1: Example of cross-vehicle extrinsics calibration: (a) LiDAR pointclouds registration; (b) Pointclouds of $\mathcal{L}_f$ projected on camera $C_6$ (mounted on front vehicle); (c) Pointclouds of $\mathcal{L}_r$ projected on camera $C_6$; (d) Within- and cross-vehicle spatial contexts and transformations.
  • Figure D2: Example of spatial warps: (a) Color image of $C_5$; (b) Spatial warp from left camera $C_9$; (c) Spatial cross-vehicle warp (type-2) from left camera $C_2$; (d) Spatial cross-vehicle warp (type-2) from right camera $C_3$.
  • Figure D3: Comparison of depth and direct interpolation-based surface normal reprojection method. PSN implies pseudo surface normal. ST implies spatial-temporal context.
  • ...and 5 more figures