Table of Contents
Fetching ...

BRIDGE -- Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation

Dingning Liu, Haoyu Guo, Jingyi Zhou, Tong He

TL;DR

BRIDGE tackles data scarcity and label quality in monocular depth estimation by coupling an RL-optimized Depth-to-Image engine with a hybrid supervision training regime. It synthesizes ~20M high-fidelity RGB-D samples from diverse depth maps and uses a two-source supervision mix—pseudo-labels from a powerful teacher and region-precision ground-truth depths—thereby enabling scalable, diverse, and geometry-consistent training. The MDE model employs a DINOv2-Giant encoder and a metric-depth scale head, trained with affine-invariant and gradient-based losses, achieving strong zero-shot performance and competitive metric depth after targeted fine-tuning. This approach significantly improves data efficiency and generalization, enabling robust depth perception in complex real-world and in-the-wild scenes while reducing reliance on massive real-world labeled data.

Abstract

Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at https://dingning-liu.github.io/bridge.github.io/.

BRIDGE -- Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation

TL;DR

BRIDGE tackles data scarcity and label quality in monocular depth estimation by coupling an RL-optimized Depth-to-Image engine with a hybrid supervision training regime. It synthesizes ~20M high-fidelity RGB-D samples from diverse depth maps and uses a two-source supervision mix—pseudo-labels from a powerful teacher and region-precision ground-truth depths—thereby enabling scalable, diverse, and geometry-consistent training. The MDE model employs a DINOv2-Giant encoder and a metric-depth scale head, trained with affine-invariant and gradient-based losses, achieving strong zero-shot performance and competitive metric depth after targeted fine-tuning. This approach significantly improves data efficiency and generalization, enabling robust depth perception in complex real-world and in-the-wild scenes while reducing reliance on massive real-world labeled data.

Abstract

Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at https://dingning-liu.github.io/bridge.github.io/.

Paper Structure

This paper contains 20 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: We present BRIDGE, showcasing its RL-optimized Depth-to-Image (D2I) data generation engine which is used for generating realistic and geometrically accurate RGB images from source depth maps and Monocular Depth Estimation (MDE) model which after being trained on the massive high-quality data generated by the D2I engine, achieves superior depth prediction in complex scenes.
  • Figure 2: Reward Model Process. Our D2I model is trained via reward-gradient-driven direct optimization , avoiding complex proxy objective functions and improving training memory efficiency.
  • Figure 3: Comparison between our model and leading methods, including Depth Anything V2 yang2024depth2, Depth Probochkovskii2024depth and Marigoldke2025marigold on open-world images.
  • Figure 4: Additional comparison between our model and leading methods, including Depth Anything V2 yang2024depth2, Depth Probochkovskii2024depth and Marigoldke2025marigold on "in-the-wild" images.
  • Figure 5: Our model, BRIDGE, generates superior, high-fidelity depth maps that enable ControlNet zhang2023adding to synthesize new images with a zero-shot capability, precisely replicating the depth field of the source image. In contrast, Depth Anything V2 yang2024depth2 struggles to produce an accurate depth field, as demonstrated by the clear discrepancies between its corresponding ControlNet output and the source images. The prompts used for ControlNet are displayed in the lower left corners, and all images were generated with the same random seed.