Table of Contents
Fetching ...

An uncertainty-aware framework for data-efficient multi-view animal pose estimation

Lenny Aharon, Keemin Lee, Karan Sikka, Selmaan Chettih, Cole Hurwitz, Liam Paninski, Matthew R Whiteway

TL;DR

The paper addresses data-efficient, multi-view animal pose estimation with reliable uncertainty by introducing an uncertainty-aware framework that combines an early-fusion multi-view transformer (MVT) with patch masking and 3D losses, a variance-inflated nonlinear Ensemble Kalman Smoother (mvEKS) for robust post-processing, and a distillation pipeline that transfers ensemble knowledge into a single efficient model. The approach demonstrates that cross-view attention, geometric constraints, and calibrated uncertainty significantly improve keypoint accuracy and reliability across flies, mice, and chickadees under limited-label regimes. Key contributions include the MVT with cross-view patch masking, the nonlinear mvEKS with variance inflation, and a distillation workflow using high-quality EKS pseudo-labels to achieve strong single-model performance. Collectively, the framework enables practical, uncertainty-aware pose estimation suitable for real-world behavioral analyses in laboratory and field-like settings, with broad adaptability to calibration availability and data constraints.

Abstract

Multi-view pose estimation is essential for quantifying animal behavior in scientific research, yet current methods struggle to achieve accurate tracking with limited labeled data and suffer from poor uncertainty estimates. We address these challenges with a comprehensive framework combining novel training and post-processing techniques, and a model distillation procedure that leverages the strengths of these techniques to produce a more efficient and effective pose estimator. Our multi-view transformer (MVT) utilizes pretrained backbones and enables simultaneous processing of information across all views, while a novel patch masking scheme learns robust cross-view correspondences without camera calibration. For calibrated setups, we incorporate geometric consistency through 3D augmentation and a triangulation loss. We extend the existing Ensemble Kalman Smoother (EKS) post-processor to the nonlinear case and enhance uncertainty quantification via a variance inflation technique. Finally, to leverage the scaling properties of the MVT, we design a distillation procedure that exploits improved EKS predictions and uncertainty estimates to generate high-quality pseudo-labels, thereby reducing dependence on manual labels. Our framework components consistently outperform existing methods across three diverse animal species (flies, mice, chickadees), with each component contributing complementary benefits. The result is a practical, uncertainty-aware system for reliable pose estimation that enables downstream behavioral analyses under real-world data constraints.

An uncertainty-aware framework for data-efficient multi-view animal pose estimation

TL;DR

The paper addresses data-efficient, multi-view animal pose estimation with reliable uncertainty by introducing an uncertainty-aware framework that combines an early-fusion multi-view transformer (MVT) with patch masking and 3D losses, a variance-inflated nonlinear Ensemble Kalman Smoother (mvEKS) for robust post-processing, and a distillation pipeline that transfers ensemble knowledge into a single efficient model. The approach demonstrates that cross-view attention, geometric constraints, and calibrated uncertainty significantly improve keypoint accuracy and reliability across flies, mice, and chickadees under limited-label regimes. Key contributions include the MVT with cross-view patch masking, the nonlinear mvEKS with variance inflation, and a distillation workflow using high-quality EKS pseudo-labels to achieve strong single-model performance. Collectively, the framework enables practical, uncertainty-aware pose estimation suitable for real-world behavioral analyses in laboratory and field-like settings, with broad adaptability to calibration availability and data constraints.

Abstract

Multi-view pose estimation is essential for quantifying animal behavior in scientific research, yet current methods struggle to achieve accurate tracking with limited labeled data and suffer from poor uncertainty estimates. We address these challenges with a comprehensive framework combining novel training and post-processing techniques, and a model distillation procedure that leverages the strengths of these techniques to produce a more efficient and effective pose estimator. Our multi-view transformer (MVT) utilizes pretrained backbones and enables simultaneous processing of information across all views, while a novel patch masking scheme learns robust cross-view correspondences without camera calibration. For calibrated setups, we incorporate geometric consistency through 3D augmentation and a triangulation loss. We extend the existing Ensemble Kalman Smoother (EKS) post-processor to the nonlinear case and enhance uncertainty quantification via a variance inflation technique. Finally, to leverage the scaling properties of the MVT, we design a distillation procedure that exploits improved EKS predictions and uncertainty estimates to generate high-quality pseudo-labels, thereby reducing dependence on manual labels. Our framework components consistently outperform existing methods across three diverse animal species (flies, mice, chickadees), with each component contributing complementary benefits. The result is a practical, uncertainty-aware system for reliable pose estimation that enables downstream behavioral analyses under real-world data constraints.

Paper Structure

This paper contains 48 sections, 17 equations, 17 figures.

Figures (17)

  • Figure 1: Multi-view transformer with patch masking and 3D loss.Top: Single-view transformer architecture. Input frames are split into patches, embedded into a latent space, combined with a fixed position encoding, and processed through a Vision Transformer (ViT). Outputs are reshaped and passed to a heatmap head. The model is trained with a mean square error (MSE) loss between predicted and ground truth heatmaps. Multiple views are processed independently. Bottom: Multi-view transformer architecture. Pixel patches are randomly masked before patch embedding, then added to a fixed positional and learnable view encodings. A single ViT processes all views simultaneously. The model also produces predicted 3D keypoints using 2D heatmaps and camera calibration, which are compared against ground truth 3D keypoints with an additional MSE loss.
  • Figure 2: Multi-view transformer with patch masking and 3D loss improves pose estimation.a:Top: Experimental setup and labeled keypoints. Bottom: Example frames for a single instance. b: Example traces from single-view transformer (SVT; teal), and the multi-view transformer with patch masking and 3D loss (MVT++; purple; except for Treadmill Mouse, which lacks camera parameters). Bottom panels show 3D reprojection error (using 3D PCA for Treadmill Mouse), indicating more consistent predictions across views for MVT++. c: Pixel error as a function of keypoint difficulty (lower is better). Dashed vertical lines indicate the percentage of data used for the pixel error computation. Fly diagram from karashchuk2021anipose.
  • Figure 3: Multi-view Ensemble Kalman Smoother (mvEKS) improves pose estimation.a: Keypoints are modeled as projections from a 3D latent that evolves smoothly over time. Low-uncertainty observations from reliable camera views help correct high-uncertainty observations through spatial and temporal constraints. b: Traces of a mouse paw from two camera views. The optimal smoothing parameter (green) recovers the true oscillatory motion in the partly occluded Top view, while oversmoothing (purple) distorts the temporal dynamics. c: Multi-view observation where predictions are consistent across views, requiring no variance inflation. d: Inconsistent predictions across views detected by variance inflation, where the more confident predictions are correct. Orange crosses are ensemble median with ensemble variance; green crosses are corrected predictions from mvEKS with posterior predictive variance. e: Inconsistent predictions between views where a highly confident but incorrect prediction in the top view dominates; mvEKS is unable to override the confident error, but the variance inflation procedure adjusts the posterior predictive variance to reflect the remaining uncertainty. f: The ensemble median (orange) outperforms individual MVT++ models (purple); nonlinear variance-inflated mvEKS (light green) achieves the best performance. Treadmill mouse (uncalibrated setup) uses linear mvEKS.
  • Figure 4: Pseudo-label-based distillation of EKS improves pose estimation.a: Schematic of our distillation procedure. b: The distilled MVT+EKS model (orange) outperforms initial ensemble member MVT++ models (purple), but does not reach the performance of EKS (green). Enforcing geometric consistency on the distilled model output (pink) brings single-model performance levels equal to that of the full MVT+EKS pipeline (green). For calibrated setups, we also compare against the state-of-the-art ResNet-50+Anipose baseline (gray), which performs comparably to our single network distilled model without any post-processing.
  • Figure 5: Comparison of pretrained transformer and ResNet-50 backbones.ViT/B is a "base" model ($\sim$80M parameters), ViT/S is a "small" model ($\sim$20M parameters); ResNet-50 has $\sim$20M parameters.
  • ...and 12 more figures