Benchmark on Monocular Metric Depth Estimation in Wildlife Setting
Niccolò Niccoli, Lorenzo Seidenari, Ilaria Greco, Francesco Rovero
TL;DR
This work addresses the challenge of estimating metric depth from monocular wildlife imagery, where depth cues are absent and scale varies across species and environments. It presents the first wildlife-focused benchmark for monocular metric depth estimation, using a custom dataset of $93$ outdoor camera-trap images with ground-truth distances obtained from calibrated $ChARUCO$ patterns and a protocol tailored to wildlife conditions. Four state-of-the-art MDE methods—Depth Anything V2, ML Depth Pro, ZoeDepth, and Metric3D—plus a geometric baseline are evaluated, with two depth extraction strategies (median vs mean) to assess robustness; results show Depth Anything V2 achieving $MAE = $ $0.454$ m and $r = $ $0.962$, while ZoeDepth degrades significantly in outdoor settings (e.g., $MAE$ $3.087$ m). Median-depth extraction generally outperforms mean-based aggregation across methods, and the fastest yet reasonably accurate option is ZoeDepth ($0.17$s) whereas Depth Anything V2 ($0.22$s) offers the best accuracy-speed trade-off. This benchmark provides practical baselines and guidance for deploying depth estimation in wildlife monitoring, and highlights the need for wildlife-specific datasets and domain adaptation to improve real-world utility.
Abstract
Camera traps are widely used for wildlife monitoring, but extracting accurate distance measurements from monocular images remains challenging due to the lack of depth information. While monocular depth estimation (MDE) methods have advanced significantly, their performance in natural wildlife environments has not been systematically evaluated. This work introduces the first benchmark for monocular metric depth estimation in wildlife monitoring conditions. We evaluate four state-of-the-art MDE methods (Depth Anything V2, ML Depth Pro, ZoeDepth, and Metric3D) alongside a geometric baseline on 93 camera trap images with ground truth distances obtained using calibrated ChARUCO patterns. Our results demonstrate that Depth Anything V2 achieves the best overall performance with a mean absolute error of 0.454m and correlation of 0.962, while methods like ZoeDepth show significant degradation in outdoor natural environments (MAE: 3.087m). We find that median-based depth extraction consistently outperforms mean-based approaches across all deep learning methods. Additionally, we analyze computational efficiency, with ZoeDepth being fastest (0.17s per image) but least accurate, while Depth Anything V2 provides an optimal balance of accuracy and speed (0.22s per image). This benchmark establishes performance baselines for wildlife applications and provides practical guidance for implementing depth estimation in conservation monitoring systems.
