CLHOP: Combined Audio-Video Learning for Horse 3D Pose and Shape Estimation
Ci Li, Elin Hernlund, Hedvig Kjellström, Silvia Zuffi
TL;DR
This work addresses the under-constrained problem of recovering 3D horse pose and shape from monocular video by integrating synchronized audio signals. The authors introduce two audio–video fusion strategies (Early fusion and Model fusion) built on the hSMAL horse model and a vision-based backbone, enabling improved 3D reconstruction and robustness to appearance changes and self-occlusion. They validate on a treadmill dataset and a newly released Outdoor Dataset, showing that audio information enhances pose estimation, especially under challenging visual conditions, including synthetic occlusions. The work contributes a new multimodal approach to animal motion capture and provides practical datasets and analysis demonstrating the benefits and limits of audio-augmented 3D animal motion recovery, with potential extensions to other species and settings.
Abstract
In the monocular setting, predicting 3D pose and shape of animals typically relies solely on visual information, which is highly under-constrained. In this work, we explore using audio to enhance 3D shape and motion recovery of horses from monocular video. We test our approach on two datasets: an indoor treadmill dataset for 3D evaluation and an outdoor dataset capturing diverse horse movements, the latter being a contribution to this study. Our results show that incorporating sound with visual data leads to more accurate and robust motion regression. This study is the first to investigate audio's role in 3D animal motion recovery.
