Table of Contents
Fetching ...

Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning

Mengtan Zhang, Zizhan Guo, Hongbo Zhao, Yi Feng, Zuyi Xiong, Yue Wang, Shaoyi Du, Hanli Wang, Rui Fan

TL;DR

This work tackles unsupervised monocular depth and ego-motion learning by identifying and addressing the problem of mixing motion types in supervisory signals. It introduces DiMoDE, a framework that discriminatively treats motion components via optical axis and imaging plane alignments, enabling per-component geometric constraints and a constraint cycle that links depth, translations, and rotations. The approach yields SoTA performance on multiple public datasets and a new MIAS-Odom dataset, with strong robustness under adverse conditions and compatibility with various DepthNet/PosNet backbones. By reducing reliance on heavy back-end optimization and providing a general, geometry-informed training paradigm, DiMoDE advances robust, scalable depth perception for real-world monocular vision systems.

Abstract

Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.

Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning

TL;DR

This work tackles unsupervised monocular depth and ego-motion learning by identifying and addressing the problem of mixing motion types in supervisory signals. It introduces DiMoDE, a framework that discriminatively treats motion components via optical axis and imaging plane alignments, enabling per-component geometric constraints and a constraint cycle that links depth, translations, and rotations. The approach yields SoTA performance on multiple public datasets and a new MIAS-Odom dataset, with strong robustness under adverse conditions and compatibility with various DepthNet/PosNet backbones. By reducing reliance on heavy back-end optimization and providing a general, geometry-informed training paradigm, DiMoDE advances robust, scalable depth perception for real-world monocular vision systems.

Abstract

Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.

Paper Structure

This paper contains 19 sections, 31 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Tangential and radial translations result in geometrically regular but distinct depth-dependent flows. Specifically, the rigid flow induced by tangential translation varies inversely with depth, while the one resulting from radial translation is not only depth-dependent but also subject to perspective scaling. In contrast, rotation results in irregular, depth-independent flows.
  • Figure 2: The discriminative treatment of motion components and the resulting flow decomposition processes.
  • Figure 3: Ego-motion transformation decomposition: (a) Ideally, ego-motion can be decomposed into pure tangential and radial translation components; (b) In practice, errors in the PoseNet predictions introduce undesired deviations to the decomposed components.
  • Figure 4: An overview of the DiMoDE framework. Centered on the core idea of discriminative motion component treatment, depth (from DepthNet) and ego-motion (from PoseNet) are utilized to perform optical axis and imaging plane alignments. FlowNet generates dense correspondences, which are transformed during the alignment processes. The transformed flows are ultimately leveraged to incorporate two sets of geometric constraints into the unsupervised joint learning framework, thereby simultaneously improving both depth and pose estimation performance.
  • Figure 5: An illustration of our designed handheld setup equipped with calibrated and synchronized sensors for accurate real-world data collection.
  • ...and 9 more figures