DMODE: Differential Monocular Object Distance Estimation Module without Class Specific Information
Pedram Agand, Michael Chang, Mo Chen
TL;DR
DMODE addresses the problem of monocular object distance estimation without relying on object class information by fusing temporal changes in object size with camera ego-motion. It combines a detector-agnostic framework with a three-frame sequence, ResNet-18-derived latent features, and dual heads that predict Cartesian coordinates $\phi=(x,y,z)$ and distance $d$ under a BerHu training objective. Theoretical analysis generalizes distance estimation to 3D for $q+1$ frames and special cases (e.g., constant velocity), while the network architecture enforces consistency with analytic distance relations and remains robust across detectors (GT, TrackRCNN, EagerMOT) and unseen classes. Empirically, DMODE achieves competitive or superior performance on KITTI MOTS across multi-class scenarios, enabling transferable, low-cost 3D perception for autonomous driving without needing class-specific cues or intrinsic camera calibration. The approach holds promise for broad deployment where detector variability and scale-ambiguous monocular cues make class-aware methods impractical, by leveraging size dynamics and ego-motion to infer 3D structure $d=\\sqrt{x^2+y^2+z^2}$ with minimal supervision.
Abstract
Utilizing a single camera for measuring object distances is a cost-effective alternative to stereo-vision and LiDAR. Although monocular distance estimation has been explored in the literature, most existing techniques rely on object class knowledge to achieve high performance. Without this contextual data, monocular distance estimation becomes more challenging, lacking reference points and object-specific cues. However, these cues can be misleading for objects with wide-range variation or adversarial situations, which is a challenging aspect of object-agnostic distance estimation. In this paper, we propose DMODE, a class-agnostic method for monocular distance estimation that does not require object class knowledge. DMODE estimates an object's distance by fusing its fluctuation in size over time with the camera's motion, making it adaptable to various object detectors and unknown objects, thus addressing these challenges. We evaluate our model on the KITTI MOTS dataset using ground-truth bounding box annotations and outputs from TrackRCNN and EagerMOT. The object's location is determined using the change in bounding box sizes and camera position without measuring the object's detection source or class attributes. Our approach demonstrates superior performance in multi-class object distance detection scenarios compared to conventional methods.
