Table of Contents
Fetching ...

Web-Scale Collection of Video Data for 4D Animal Reconstruction

Brian Nlong Zhao, Jiajun Wu, Shangzhe Wu

TL;DR

The paper tackles the lack of large-scale, in-the-wild data for 4D quadruped reconstruction by proposing a fully automated pipeline that mines YouTube, creates object-centric clips with rich auxiliary annotations, and yields a 2M-frame dataset. It introduces Animal-in-Motion (AiM) as a benchmark of 230 carefully validated sequences and 11,061 frames, plus 4D-Fauna, a sequence-optimized model-free baseline that improves over prior methods. Through extensive benchmarking against model-based and model-free approaches, the work reveals a gap between 2D metrics and perceptual 3D quality, motivating 3D-aware evaluation metrics for 4D animal reconstruction. The pipeline, benchmark, and baseline collectively advance scalable, markerless 4D animal understanding from wild video data, while acknowledging limitations in 3D metric robustness and data privacy considerations.

Abstract

Computer vision for animals holds great promise for wildlife research but often depends on large-scale data, while existing collection methods rely on controlled capture setups. Recent data-driven approaches show the potential of single-view, non-invasive analysis, yet current animal video datasets are limited--offering as few as 2.4K 15-frame clips and lacking key processing for animal-centric 3D/4D tasks. We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips, along with auxiliary annotations valuable for downstream tasks like pose estimation, tracking, and 3D/4D reconstruction. Using this pipeline, we amass 30K videos (2M frames)--an order of magnitude more than prior works. To demonstrate its utility, we focus on the 4D quadruped animal reconstruction task. To support this task, we present Animal-in-Motion (AiM), a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions. We evaluate state-of-the-art model-based and model-free methods on Animal-in-Motion, finding that 2D metrics favor the former despite unrealistic 3D shapes, while the latter yields more natural reconstructions but scores lower--revealing a gap in current evaluation. To address this, we enhance a recent model-free approach with sequence-level optimization, establishing the first 4D animal reconstruction baseline. Together, our pipeline, benchmark, and baseline aim to advance large-scale, markerless 4D animal reconstruction and related tasks from in-the-wild videos. Code and datasets are available at https://github.com/briannlongzhao/Animal-in-Motion.

Web-Scale Collection of Video Data for 4D Animal Reconstruction

TL;DR

The paper tackles the lack of large-scale, in-the-wild data for 4D quadruped reconstruction by proposing a fully automated pipeline that mines YouTube, creates object-centric clips with rich auxiliary annotations, and yields a 2M-frame dataset. It introduces Animal-in-Motion (AiM) as a benchmark of 230 carefully validated sequences and 11,061 frames, plus 4D-Fauna, a sequence-optimized model-free baseline that improves over prior methods. Through extensive benchmarking against model-based and model-free approaches, the work reveals a gap between 2D metrics and perceptual 3D quality, motivating 3D-aware evaluation metrics for 4D animal reconstruction. The pipeline, benchmark, and baseline collectively advance scalable, markerless 4D animal understanding from wild video data, while acknowledging limitations in 3D metric robustness and data privacy considerations.

Abstract

Computer vision for animals holds great promise for wildlife research but often depends on large-scale data, while existing collection methods rely on controlled capture setups. Recent data-driven approaches show the potential of single-view, non-invasive analysis, yet current animal video datasets are limited--offering as few as 2.4K 15-frame clips and lacking key processing for animal-centric 3D/4D tasks. We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips, along with auxiliary annotations valuable for downstream tasks like pose estimation, tracking, and 3D/4D reconstruction. Using this pipeline, we amass 30K videos (2M frames)--an order of magnitude more than prior works. To demonstrate its utility, we focus on the 4D quadruped animal reconstruction task. To support this task, we present Animal-in-Motion (AiM), a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions. We evaluate state-of-the-art model-based and model-free methods on Animal-in-Motion, finding that 2D metrics favor the former despite unrealistic 3D shapes, while the latter yields more natural reconstructions but scores lower--revealing a gap in current evaluation. To address this, we enhance a recent model-free approach with sequence-level optimization, establishing the first 4D animal reconstruction baseline. Together, our pipeline, benchmark, and baseline aim to advance large-scale, markerless 4D animal reconstruction and related tasks from in-the-wild videos. Code and datasets are available at https://github.com/briannlongzhao/Animal-in-Motion.

Paper Structure

This paper contains 64 sections, 7 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: We propose a fully automated data pipeline to collect and process data from scratch, making it ready for downstream tasks such as 4D animal reconstruction.
  • Figure 2: Overview of our proposed data pipeline. Data is automatically scraped from YouTube in stage 1, and then preprocessed into clips in stage 2. Stage 3 detects and tracks the instances, and the final stage extracts additional image features for 4D animal reconstruction.
  • Figure 3: A detailed breakdown of the number of frames collected for each animal category in full dataset.
  • Figure 4: A detailed breakdown of the number of frames collected in benchmark data for each animal category.
  • Figure 5: Visualization of a collected data sample. Each video is annotated per frame with keypoints, masks, depth maps, occlusion boundaries, optical flow, and DINO features.
  • ...and 9 more figures