Table of Contents
Fetching ...

CLHOP: Combined Audio-Video Learning for Horse 3D Pose and Shape Estimation

Ci Li, Elin Hernlund, Hedvig Kjellström, Silvia Zuffi

TL;DR

This work addresses the under-constrained problem of recovering 3D horse pose and shape from monocular video by integrating synchronized audio signals. The authors introduce two audio–video fusion strategies (Early fusion and Model fusion) built on the hSMAL horse model and a vision-based backbone, enabling improved 3D reconstruction and robustness to appearance changes and self-occlusion. They validate on a treadmill dataset and a newly released Outdoor Dataset, showing that audio information enhances pose estimation, especially under challenging visual conditions, including synthetic occlusions. The work contributes a new multimodal approach to animal motion capture and provides practical datasets and analysis demonstrating the benefits and limits of audio-augmented 3D animal motion recovery, with potential extensions to other species and settings.

Abstract

In the monocular setting, predicting 3D pose and shape of animals typically relies solely on visual information, which is highly under-constrained. In this work, we explore using audio to enhance 3D shape and motion recovery of horses from monocular video. We test our approach on two datasets: an indoor treadmill dataset for 3D evaluation and an outdoor dataset capturing diverse horse movements, the latter being a contribution to this study. Our results show that incorporating sound with visual data leads to more accurate and robust motion regression. This study is the first to investigate audio's role in 3D animal motion recovery.

CLHOP: Combined Audio-Video Learning for Horse 3D Pose and Shape Estimation

TL;DR

This work addresses the under-constrained problem of recovering 3D horse pose and shape from monocular video by integrating synchronized audio signals. The authors introduce two audio–video fusion strategies (Early fusion and Model fusion) built on the hSMAL horse model and a vision-based backbone, enabling improved 3D reconstruction and robustness to appearance changes and self-occlusion. They validate on a treadmill dataset and a newly released Outdoor Dataset, showing that audio information enhances pose estimation, especially under challenging visual conditions, including synthetic occlusions. The work contributes a new multimodal approach to animal motion capture and provides practical datasets and analysis demonstrating the benefits and limits of audio-augmented 3D animal motion recovery, with potential extensions to other species and settings.

Abstract

In the monocular setting, predicting 3D pose and shape of animals typically relies solely on visual information, which is highly under-constrained. In this work, we explore using audio to enhance 3D shape and motion recovery of horses from monocular video. We test our approach on two datasets: an indoor treadmill dataset for 3D evaluation and an outdoor dataset capturing diverse horse movements, the latter being a contribution to this study. Our results show that incorporating sound with visual data leads to more accurate and robust motion regression. This study is the first to investigate audio's role in 3D animal motion recovery.
Paper Structure (25 sections, 6 equations, 9 figures, 2 tables)

This paper contains 25 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: We estimate the articulated 3D motion of a horse from video, combining both visual and auditory information. We show that by training with both of these modalities, we are able to reconstruct poses that are more accurate and natural, even under self-occlusion. In figure, we show results for an Image-only network (red) and for two networks that exploit audio: Early-fusion (green) and Model-fusion (blue). Note that both audio-based networks can reconstruct more natural head pose and correctly estimate the front left hoof touching the ground.
  • Figure 2: Video--audio fusion frameworks. (a) Early-fusion, (b) Model-fusion. Both networks use the same architecture for feature extraction and predicting $C_{1:T}$, $\beta$, $\theta_{1:T}^{Global}$ using video features. The key difference lies in estimating $\theta_{1:T}^{Joints}$,: (a) combines video and audio features before $\Phi$ to estimate $\theta_{1:T}^{Joints}$ and (b) processes through the shared $\Phi$ separately to obtain $\theta_{1:T}^{I_{Joints}}$ and $\theta_{1:T}^{A_{Joints}}$. All parameters are then mapped to 3D meshes $\mathbf{v_{1:T}}$ for 2D projection and loss calculation. Inputs are in orange and learnable modules in green.
  • Figure 3: Example results of different networks in the Treadmill Dataset with Test Data 1 (a) and Test Data 2 (b). (i) The model is shown in different views. (ii) Model overlapped with the original images. Refer to the main text for more details.
  • Figure 4: Samples of full body visible for the Outdoor Dataset. Image-only Network (in LightRed), Early-fusion Network (in in LightGreen), Model-fusion Network (in LightBlue).
  • Figure 5: Sample outputs on the Outdoor Dataset. Image-only Network (in LightRed), Early-fusion Network (in LightGreen), Model-fusion Network (in LightBlue).
  • ...and 4 more figures