BRIDGE -- Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation
Dingning Liu, Haoyu Guo, Jingyi Zhou, Tong He
TL;DR
BRIDGE tackles data scarcity and label quality in monocular depth estimation by coupling an RL-optimized Depth-to-Image engine with a hybrid supervision training regime. It synthesizes ~20M high-fidelity RGB-D samples from diverse depth maps and uses a two-source supervision mix—pseudo-labels from a powerful teacher and region-precision ground-truth depths—thereby enabling scalable, diverse, and geometry-consistent training. The MDE model employs a DINOv2-Giant encoder and a metric-depth scale head, trained with affine-invariant and gradient-based losses, achieving strong zero-shot performance and competitive metric depth after targeted fine-tuning. This approach significantly improves data efficiency and generalization, enabling robust depth perception in complex real-world and in-the-wild scenes while reducing reliance on massive real-world labeled data.
Abstract
Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at https://dingning-liu.github.io/bridge.github.io/.
