Table of Contents
Fetching ...

CoL3D: Collaborative Learning of Single-view Depth and Camera Intrinsics for Metric 3D Shape Recovery

Chenghao Zhang, Lubin Fan, Shen Cao, Bojian Wu, Jieping Ye

TL;DR

CoL3D addresses metric 3D shape recovery from a single image by jointly estimating depth and camera intrinsics within a unified network. It introduces a canonical incidence field as a prior and a shape similarity loss to align 3D geometry, enabling differentiable optimization from depth to the 3D point cloud. Across indoor and outdoor benchmarks, CoL3D achieves strong depth accuracy, robust camera calibration, and high-fidelity 3D reconstructions on NYU, KITTI, SUN RGB-D, and related datasets, even with in-domain training. The work demonstrates a reciprocal relationship between depth and intrinsics and offers a practical pathway for single-view metric 3D perception in robotics.

Abstract

Recovering the metric 3D shape from a single image is particularly relevant for robotics and embodied intelligence applications, where accurate spatial understanding is crucial for navigation and interaction with environments. Usually, the mainstream approaches achieve it through monocular depth estimation. However, without camera intrinsics, the 3D metric shape can not be recovered from depth alone. In this study, we theoretically demonstrate that depth serves as a 3D prior constraint for estimating camera intrinsics and uncover the reciprocal relations between these two elements. Motivated by this, we propose a collaborative learning framework for jointly estimating depth and camera intrinsics, named CoL3D, to learn metric 3D shapes from single images. Specifically, CoL3D adopts a unified network and performs collaborative optimization at three levels: depth, camera intrinsics, and 3D point clouds. For camera intrinsics, we design a canonical incidence field mechanism as a prior that enables the model to learn the residual incident field for enhanced calibration. Additionally, we incorporate a shape similarity measurement loss in the point cloud space, which improves the quality of 3D shapes essential for robotic applications. As a result, when training and testing on a single dataset with in-domain settings, CoL3D delivers outstanding performance in both depth estimation and camera calibration across several indoor and outdoor benchmark datasets, which leads to remarkable 3D shape quality for the perception capabilities of robots.

CoL3D: Collaborative Learning of Single-view Depth and Camera Intrinsics for Metric 3D Shape Recovery

TL;DR

CoL3D addresses metric 3D shape recovery from a single image by jointly estimating depth and camera intrinsics within a unified network. It introduces a canonical incidence field as a prior and a shape similarity loss to align 3D geometry, enabling differentiable optimization from depth to the 3D point cloud. Across indoor and outdoor benchmarks, CoL3D achieves strong depth accuracy, robust camera calibration, and high-fidelity 3D reconstructions on NYU, KITTI, SUN RGB-D, and related datasets, even with in-domain training. The work demonstrates a reciprocal relationship between depth and intrinsics and offers a practical pathway for single-view metric 3D perception in robotics.

Abstract

Recovering the metric 3D shape from a single image is particularly relevant for robotics and embodied intelligence applications, where accurate spatial understanding is crucial for navigation and interaction with environments. Usually, the mainstream approaches achieve it through monocular depth estimation. However, without camera intrinsics, the 3D metric shape can not be recovered from depth alone. In this study, we theoretically demonstrate that depth serves as a 3D prior constraint for estimating camera intrinsics and uncover the reciprocal relations between these two elements. Motivated by this, we propose a collaborative learning framework for jointly estimating depth and camera intrinsics, named CoL3D, to learn metric 3D shapes from single images. Specifically, CoL3D adopts a unified network and performs collaborative optimization at three levels: depth, camera intrinsics, and 3D point clouds. For camera intrinsics, we design a canonical incidence field mechanism as a prior that enables the model to learn the residual incident field for enhanced calibration. Additionally, we incorporate a shape similarity measurement loss in the point cloud space, which improves the quality of 3D shapes essential for robotic applications. As a result, when training and testing on a single dataset with in-domain settings, CoL3D delivers outstanding performance in both depth estimation and camera calibration across several indoor and outdoor benchmark datasets, which leads to remarkable 3D shape quality for the perception capabilities of robots.

Paper Structure

This paper contains 16 sections, 2 theorems, 18 equations, 5 figures, 8 tables.

Key Result

Proposition 1

Given the depth map of an image, the 4 DoF camera intrinsics can be determined by 4 non-overlapping groups of pixels in the image with their Euclidean distances in the 3D space.

Figures (5)

  • Figure 1: Comparison of our collaborative learning framework with single-task monocular depth estimation and camera calibration.
  • Figure 2: Overview of the proposed CoL3D framework. It consists of an Encoder and Decoder for latent feature extraction, a Depth Head for depth prediction, and a Camera Head for camera intrinsics estimation. Collaborative learning is performed on the depth map, the incident field, and the 3D point cloud. Note that camera intrinsics are only used for training and are predicted by the model itself at inference.
  • Figure 3: Qualitative 3d shape comparison on the NYU dataset. The red boxes indicate the regions to focus on.
  • Figure 4: Qualitative 3D shape comparison on the KITTI dataset. The red boxes show the regions to focus on.
  • Figure 5: Effect of canonical focal length on NYU dataset.

Theorems & Definitions (3)

  • Proposition
  • Proposition
  • proof