Table of Contents
Fetching ...

Pose Splatter: A 3D Gaussian Splatting Model for Quantifying Animal Pose and Appearance

Jack Goffinet, Youngjo Min, Carlo Tomasi, David E. Carlson

TL;DR

Pose Splatter introduces a scalable, annotation-free framework for reconstructing and quantifying full 3D animal pose and appearance using shape carving and 3D Gaussian splatting. It replaces per-frame optimization and manual labeling with a feed-forward pipeline that refines a voxel prior through a stacked U-Net and renders via Gaussian splats, achieving accurate geometry across mouse, rat, and zebra finch with sparse views. A rotation-invariant visual embedding derived from spherical harmonics provides a compact, informative descriptor for downstream behavioral analyses, and experiments show superior cross-view generalization and subtle movement capture compared with keypoint baselines. The approach enables high-resolution, longitudinal behavioral studies by significantly reducing annotation and computation bottlenecks, with practical implications for mapping genotype and neural activity to micro-behavior.

Abstract

Accurate and scalable quantification of animal pose and appearance is crucial for studying behavior. Current 3D pose estimation techniques, such as keypoint- and mesh-based techniques, often face challenges including limited representational detail, labor-intensive annotation requirements, and expensive per-frame optimization. These limitations hinder the study of subtle movements and can make large-scale analyses impractical. We propose Pose Splatter, a novel framework leveraging shape carving and 3D Gaussian splatting to model the complete pose and appearance of laboratory animals without prior knowledge of animal geometry, per-frame optimization, or manual annotations. We also propose a novel rotation-invariant visual embedding technique for encoding pose and appearance, designed to be a plug-in replacement for 3D keypoint data in downstream behavioral analyses. Experiments on datasets of mice, rats, and zebra finches show Pose Splatter learns accurate 3D animal geometries. Notably, Pose Splatter represents subtle variations in pose, provides better low-dimensional pose embeddings over state-of-the-art as evaluated by humans, and generalizes to unseen data. By eliminating annotation and per-frame optimization bottlenecks, Pose Splatter enables analysis of large-scale, longitudinal behavior needed to map genotype, neural activity, and micro-behavior at unprecedented resolution.

Pose Splatter: A 3D Gaussian Splatting Model for Quantifying Animal Pose and Appearance

TL;DR

Pose Splatter introduces a scalable, annotation-free framework for reconstructing and quantifying full 3D animal pose and appearance using shape carving and 3D Gaussian splatting. It replaces per-frame optimization and manual labeling with a feed-forward pipeline that refines a voxel prior through a stacked U-Net and renders via Gaussian splats, achieving accurate geometry across mouse, rat, and zebra finch with sparse views. A rotation-invariant visual embedding derived from spherical harmonics provides a compact, informative descriptor for downstream behavioral analyses, and experiments show superior cross-view generalization and subtle movement capture compared with keypoint baselines. The approach enables high-resolution, longitudinal behavioral studies by significantly reducing annotation and computation bottlenecks, with practical implications for mapping genotype and neural activity to micro-behavior.

Abstract

Accurate and scalable quantification of animal pose and appearance is crucial for studying behavior. Current 3D pose estimation techniques, such as keypoint- and mesh-based techniques, often face challenges including limited representational detail, labor-intensive annotation requirements, and expensive per-frame optimization. These limitations hinder the study of subtle movements and can make large-scale analyses impractical. We propose Pose Splatter, a novel framework leveraging shape carving and 3D Gaussian splatting to model the complete pose and appearance of laboratory animals without prior knowledge of animal geometry, per-frame optimization, or manual annotations. We also propose a novel rotation-invariant visual embedding technique for encoding pose and appearance, designed to be a plug-in replacement for 3D keypoint data in downstream behavioral analyses. Experiments on datasets of mice, rats, and zebra finches show Pose Splatter learns accurate 3D animal geometries. Notably, Pose Splatter represents subtle variations in pose, provides better low-dimensional pose embeddings over state-of-the-art as evaluated by humans, and generalizes to unseen data. By eliminating annotation and per-frame optimization bottlenecks, Pose Splatter enables analysis of large-scale, longitudinal behavior needed to map genotype, neural activity, and micro-behavior at unprecedented resolution.

Paper Structure

This paper contains 40 sections, 5 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: (a) Pose Splatter pipeline. Multi-view images and their corresponding masks are carved into a coarse voxel shape, which a stacked U-Net converts into de-voxelized 3D Gaussian parameters that are finally rendered through Gaussian splatting. The entire process runs in only 2.5 GB of GPU memory (VRAM). (b) Shape-carving concept. Silhouettes from each camera are back-projected into a shared voxel grid (yellow cones), removing voxels outside the visual hull. The green intersection marks the rough volumetric prior fed to the network.
  • Figure 2: (a) Example renderings for visual embedding. 32 virtual cameras, distributed on a sphere centered on the animal, produce appearance-only renderings used to build our visual embedding. (b) Nearest-neighbor preference study. Nearest-neighbor retrieval with the visual embedding (VE) is favored over a 3D-keypoint (KP) baseline in a 102-person study (54 % vs. 50 %; $p=1.5\times10^{-5}$, two-sided $t$-test, $n=40$). The query pose (left) and its two candidates illustrate that the visual embedding preserves a subtle leftward head tilt that the keypoint method misses. (c) Visual embedding tracks subtle movements. Two image rows are shown—the upper row contains ground-truth frames and the lower row the corresponding Pose Splatter renders, illustrating from left to right a typical pose, slight feather expansion, and a head-shaking bout. Beneath them, the first five principal components (PC1–PC5) plotted through time reveal these behaviors: thin grey lines indicate head reorientations that coincide with changes in PCs 2–4, dark-grey bands mark brief head-shaking bouts that stand out in PCs 1 and 5, and the surrounding light-blue interval captures slow feather expansion and compression, reflected in the low-frequency trends of PCs 1 and 2.
  • Figure 3: (a) Cross-species renderings. (b) Renderings given different numbers of input views. The rendered views are novel for the 5- and 4-camera models.
  • Figure 4: (a) Single-view reconstruction. Against single-view baselines, MagicPony and 3D Fauna collapse when the camera departs from the input view, failing to recover a plausible mouse geometry. Pose Splatter, by contrast, reconstructs accurate shapes from all viewpoints of an unseen time step in the test set. (b) Sparse-view 3DGS comparison. Most sparse-view 3DGS baselines reproduce the white background well but fail to reconstruct the given subject. Consequently, their quantitative scores appear high even though the rendered animals lack detail. See Table \ref{['tab:quant_results']}a for quantitative scores. (c) Comparison with per-scene-optimized 3DGS (4 view). Some methods post good metrics yet still fail to reconstruct the given subject. See Table \ref{['tab:quant_results']}b for metrics. Only GaussianObject and Pose Splatter deliver comparable, visually convincing foreground reconstructions.
  • Figure 5: Left$R^2$ values of predicting egocentric 3D keypoints from visual embeddings. Each scatterpoint represents a single manually annotated keypoint. Right: Accuracies of logistic regression models predicting different manually annotated behaviors using egocentric 3D keypoints (gray) versus visual embeddings (purple). Six of eight behaviors are better predicted by the visual embedding.
  • ...and 8 more figures