Web-Scale Collection of Video Data for 4D Animal Reconstruction
Brian Nlong Zhao, Jiajun Wu, Shangzhe Wu
TL;DR
The paper tackles the lack of large-scale, in-the-wild data for 4D quadruped reconstruction by proposing a fully automated pipeline that mines YouTube, creates object-centric clips with rich auxiliary annotations, and yields a 2M-frame dataset. It introduces Animal-in-Motion (AiM) as a benchmark of 230 carefully validated sequences and 11,061 frames, plus 4D-Fauna, a sequence-optimized model-free baseline that improves over prior methods. Through extensive benchmarking against model-based and model-free approaches, the work reveals a gap between 2D metrics and perceptual 3D quality, motivating 3D-aware evaluation metrics for 4D animal reconstruction. The pipeline, benchmark, and baseline collectively advance scalable, markerless 4D animal understanding from wild video data, while acknowledging limitations in 3D metric robustness and data privacy considerations.
Abstract
Computer vision for animals holds great promise for wildlife research but often depends on large-scale data, while existing collection methods rely on controlled capture setups. Recent data-driven approaches show the potential of single-view, non-invasive analysis, yet current animal video datasets are limited--offering as few as 2.4K 15-frame clips and lacking key processing for animal-centric 3D/4D tasks. We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips, along with auxiliary annotations valuable for downstream tasks like pose estimation, tracking, and 3D/4D reconstruction. Using this pipeline, we amass 30K videos (2M frames)--an order of magnitude more than prior works. To demonstrate its utility, we focus on the 4D quadruped animal reconstruction task. To support this task, we present Animal-in-Motion (AiM), a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions. We evaluate state-of-the-art model-based and model-free methods on Animal-in-Motion, finding that 2D metrics favor the former despite unrealistic 3D shapes, while the latter yields more natural reconstructions but scores lower--revealing a gap in current evaluation. To address this, we enhance a recent model-free approach with sequence-level optimization, establishing the first 4D animal reconstruction baseline. Together, our pipeline, benchmark, and baseline aim to advance large-scale, markerless 4D animal reconstruction and related tasks from in-the-wild videos. Code and datasets are available at https://github.com/briannlongzhao/Animal-in-Motion.
