An uncertainty-aware framework for data-efficient multi-view animal pose estimation
Lenny Aharon, Keemin Lee, Karan Sikka, Selmaan Chettih, Cole Hurwitz, Liam Paninski, Matthew R Whiteway
TL;DR
The paper addresses data-efficient, multi-view animal pose estimation with reliable uncertainty by introducing an uncertainty-aware framework that combines an early-fusion multi-view transformer (MVT) with patch masking and 3D losses, a variance-inflated nonlinear Ensemble Kalman Smoother (mvEKS) for robust post-processing, and a distillation pipeline that transfers ensemble knowledge into a single efficient model. The approach demonstrates that cross-view attention, geometric constraints, and calibrated uncertainty significantly improve keypoint accuracy and reliability across flies, mice, and chickadees under limited-label regimes. Key contributions include the MVT with cross-view patch masking, the nonlinear mvEKS with variance inflation, and a distillation workflow using high-quality EKS pseudo-labels to achieve strong single-model performance. Collectively, the framework enables practical, uncertainty-aware pose estimation suitable for real-world behavioral analyses in laboratory and field-like settings, with broad adaptability to calibration availability and data constraints.
Abstract
Multi-view pose estimation is essential for quantifying animal behavior in scientific research, yet current methods struggle to achieve accurate tracking with limited labeled data and suffer from poor uncertainty estimates. We address these challenges with a comprehensive framework combining novel training and post-processing techniques, and a model distillation procedure that leverages the strengths of these techniques to produce a more efficient and effective pose estimator. Our multi-view transformer (MVT) utilizes pretrained backbones and enables simultaneous processing of information across all views, while a novel patch masking scheme learns robust cross-view correspondences without camera calibration. For calibrated setups, we incorporate geometric consistency through 3D augmentation and a triangulation loss. We extend the existing Ensemble Kalman Smoother (EKS) post-processor to the nonlinear case and enhance uncertainty quantification via a variance inflation technique. Finally, to leverage the scaling properties of the MVT, we design a distillation procedure that exploits improved EKS predictions and uncertainty estimates to generate high-quality pseudo-labels, thereby reducing dependence on manual labels. Our framework components consistently outperform existing methods across three diverse animal species (flies, mice, chickadees), with each component contributing complementary benefits. The result is a practical, uncertainty-aware system for reliable pose estimation that enables downstream behavioral analyses under real-world data constraints.
