Table of Contents
Fetching ...

Depth as Prior Knowledge for Object Detection

Moussa Kassem Sbeyti, Nadja Klein

TL;DR

DepthPrior reframes depth as prior knowledge rather than a feature fusion input to object detectors, addressing the systematic degradation in small/distant object detection caused by depth-induced heteroscedasticity. The authors formalize the depth-detection relationship, then introduce three modular components—Depth-Based Loss Weighting (DLW), Depth-Based Loss Stratification (DLS), and Depth-Aware Confidence Thresholding (DCT)—that operate during training and inference without modifying detector architectures. Across four diverse benchmarks and two detector families, DepthPrior yields consistent gains, notably up to +9% mAP$_S$ and +7% mAR$_S$ for small objects, and enables depth-aware post-processing with minimal overhead via depth estimation. The approach demonstrates that depth-informed supervision, even from monocular depth estimates, can meaningfully improve distant object recall and precision, offering a practical, plug-and-play solution for safety-critical perception tasks.

Abstract

Detecting small and distant objects remains challenging for object detectors due to scale variation, low resolution, and background clutter. Safety-critical applications require reliable detection of these objects for safe planning. Depth information can improve detection, but existing approaches require complex, model-specific architectural modifications. We provide a theoretical analysis followed by an empirical investigation of the depth-detection relationship. Together, they explain how depth causes systematic performance degradation and why depth-informed supervision mitigates it. We introduce DepthPrior, a framework that uses depth as prior knowledge rather than as a fused feature, providing comparable benefits without modifying detector architectures. DepthPrior consists of Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training, and Depth-Aware Confidence Thresholding (DCT) during inference. The only overhead is the initial cost of depth estimation. Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) demonstrate the effectiveness of DepthPrior, achieving up to +9% mAP$_S$ and +7% mAR$_S$ for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). DepthPrior offers these benefits without additional sensors, architectural changes, or performance costs. Code is available at https://github.com/mos-ks/DepthPrior.

Depth as Prior Knowledge for Object Detection

TL;DR

DepthPrior reframes depth as prior knowledge rather than a feature fusion input to object detectors, addressing the systematic degradation in small/distant object detection caused by depth-induced heteroscedasticity. The authors formalize the depth-detection relationship, then introduce three modular components—Depth-Based Loss Weighting (DLW), Depth-Based Loss Stratification (DLS), and Depth-Aware Confidence Thresholding (DCT)—that operate during training and inference without modifying detector architectures. Across four diverse benchmarks and two detector families, DepthPrior yields consistent gains, notably up to +9% mAP and +7% mAR for small objects, and enables depth-aware post-processing with minimal overhead via depth estimation. The approach demonstrates that depth-informed supervision, even from monocular depth estimates, can meaningfully improve distant object recall and precision, offering a practical, plug-and-play solution for safety-critical perception tasks.

Abstract

Detecting small and distant objects remains challenging for object detectors due to scale variation, low resolution, and background clutter. Safety-critical applications require reliable detection of these objects for safe planning. Depth information can improve detection, but existing approaches require complex, model-specific architectural modifications. We provide a theoretical analysis followed by an empirical investigation of the depth-detection relationship. Together, they explain how depth causes systematic performance degradation and why depth-informed supervision mitigates it. We introduce DepthPrior, a framework that uses depth as prior knowledge rather than as a fused feature, providing comparable benefits without modifying detector architectures. DepthPrior consists of Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training, and Depth-Aware Confidence Thresholding (DCT) during inference. The only overhead is the initial cost of depth estimation. Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) demonstrate the effectiveness of DepthPrior, achieving up to +9% mAP and +7% mAR for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). DepthPrior offers these benefits without additional sensors, architectural changes, or performance costs. Code is available at https://github.com/mos-ks/DepthPrior.
Paper Structure (65 sections, 4 theorems, 26 equations, 20 figures, 19 tables, 2 algorithms)

This paper contains 65 sections, 4 theorems, 26 equations, 20 figures, 19 tables, 2 algorithms.

Key Result

Proposition 3.1

Under Assumptions assumption:intensity_distance and assumption:variance_signal, the conditional variance of the detection loss is: where $\sigma_0^2 = \alpha^2/\kappa$.

Figures (20)

  • Figure 1: DepthPrior framework (notation simplified for clarity). Top (DLW): non-linear weighting $w_i = 1 + \alpha \cdot \exp(d_{i,\text{norm}})$ prioritizes distant objects. Middle (DLS): binary masks decompose loss into close/distant components with weights $\lambda_{\text{close}}, \lambda_{\text{distant}}$. Bottom (DCT): learned splines $\tau(d_{i,\text{norm}})$ adjust thresholds during inference. The framework requires only monocular depth estimation and operates during both training and inference without architectural modifications.
  • Figure 2: Depth distribution of all GT objects (blue) vs. MD (orange) for EfficientDet on validation data. Object counts shown in parentheses.
  • Figure 3: Depth-dependent error distributions (MD, red and ED, gray) at confidence thresholds $\tau_0 = 0.4$ (left) and $\tau_0 = 0.9$ (right) for EfficientDet on the validation set.
  • Figure 4: Match rate heatmaps for YOLOv11 on the validation set. Color indicates fraction of detections matching GT.
  • Figure 5: Static (blue) vs. DCT-recovered detections (orange) on inference data. Left: KITTI with EfficientDet. Right: VisDrone with YOLOv11.
  • ...and 15 more figures

Theorems & Definitions (15)

  • Definition 3.1: Signal Quality Function
  • Proposition 3.1: Depth-Induced Heteroscedasticity
  • Corollary 3.2: Bias Toward Nearby Objects
  • Remark 3.1
  • Definition 3.2: Variance-Compensating Weights
  • Remark 3.2
  • Definition 3.3: Depth-Based Loss Stratification
  • Proposition 3.3: Gradient of $\mathcal{L}_{\text{strat}}$
  • Remark 3.3
  • Remark 3.4
  • ...and 5 more