Table of Contents
Fetching ...

Object Agnostic 3D Lifting in Space and Time

Christopher Fusco, Shin-Fang Ch'ng, Mosam Dabhi, Simon Lucey

TL;DR

This work introduces Object Agnostic 3D Lifting in Space and Time, a temporally aware lifting framework that generalizes across object categories without category-specific training. By embedding a temporal inductive bias into a transformer-based architecture, the model focuses on nearby frames to capture motion while modeling spatial joint relations via a dual-graph space encoder. A Procrustes-aligned decoder produces canonical 3D skeletons, with a loss that jointly optimizes geometry and velocity to ensure smooth sequence reconstruction. The authors also provide AnimalSyn3D, a synthetic 4D animal dataset with 13 categories to benchmark class-agnostic lifting, and demonstrate strong gains in SA-MPJPE, SA-MPVE, and FA-MPJPE under occlusion, unseen categories, and rig transfer scenarios. These results highlight improved generalization, data efficiency, and applicability to diverse real-world contexts where labeled 3D animal data are scarce.

Abstract

We present a spatio-temporal perspective on category-agnostic 3D lifting of 2D keypoints over a temporal sequence. Our approach differs from existing state-of-the-art methods that are either: (i) object-agnostic, but can only operate on individual frames, or (ii) can model space-time dependencies, but are only designed to work with a single object category. Our approach is grounded in two core principles. First, general information about similar objects can be leveraged to achieve better performance when there is little object-specific training data. Second, a temporally-proximate context window is advantageous for achieving consistency throughout a sequence. These two principles allow us to outperform current state-of-the-art methods on per-frame and per-sequence metrics for a variety of animal categories. Lastly, we release a new synthetic dataset containing 3D skeletons and motion sequences for a variety of animal categories.

Object Agnostic 3D Lifting in Space and Time

TL;DR

This work introduces Object Agnostic 3D Lifting in Space and Time, a temporally aware lifting framework that generalizes across object categories without category-specific training. By embedding a temporal inductive bias into a transformer-based architecture, the model focuses on nearby frames to capture motion while modeling spatial joint relations via a dual-graph space encoder. A Procrustes-aligned decoder produces canonical 3D skeletons, with a loss that jointly optimizes geometry and velocity to ensure smooth sequence reconstruction. The authors also provide AnimalSyn3D, a synthetic 4D animal dataset with 13 categories to benchmark class-agnostic lifting, and demonstrate strong gains in SA-MPJPE, SA-MPVE, and FA-MPJPE under occlusion, unseen categories, and rig transfer scenarios. These results highlight improved generalization, data efficiency, and applicability to diverse real-world contexts where labeled 3D animal data are scarce.

Abstract

We present a spatio-temporal perspective on category-agnostic 3D lifting of 2D keypoints over a temporal sequence. Our approach differs from existing state-of-the-art methods that are either: (i) object-agnostic, but can only operate on individual frames, or (ii) can model space-time dependencies, but are only designed to work with a single object category. Our approach is grounded in two core principles. First, general information about similar objects can be leveraged to achieve better performance when there is little object-specific training data. Second, a temporally-proximate context window is advantageous for achieving consistency throughout a sequence. These two principles allow us to outperform current state-of-the-art methods on per-frame and per-sequence metrics for a variety of animal categories. Lastly, we release a new synthetic dataset containing 3D skeletons and motion sequences for a variety of animal categories.

Paper Structure

This paper contains 46 sections, 17 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Left: Bottom row shows the 3D skeletons of a puma animal in motion. The blue lines represent our model’s predictions, closely tracking the red ground-truth lines, demonstrating our model's ability to generate smooth and precise motion over time. The dashed line highlights the trajectory of a specific joint $\hat{\mathbf{Y}}_{t,j}$, emphasizing the temporal consistency and accuracy of our approach. Right: Quantitative FA-MPJPE comparison across 13 animal categories, where our method consistently outperforms competing models.
  • Figure 2: Overview of our data pipeline and 3D lifting model. The left side of the figure demonstrates (a) the process of calculating skeleton joints from animal mesh vertices, and (b) the projection of the those joints into 2D keypoints. The right side of the figure illustrates our lifting model at a high-level. The sequence of 2D input and temporal index is projected and passed through our motion encoder and space encoder layers. The spatio-temporal latent features are decoded into canonical 3D structures. The canonical structures are then aligned to the ground truth (GT) via procrustes-alignment for calculating the loss.
  • Figure 3: Quantitative comparison on a Deer sequence from two different views: Our method provides significantly more accurate 3D predictions. In this visualization, blue represents the predicted 3D points whereas the orange denotes the ground truth.
  • Figure 4: OOD generalization. OOD to unseen data (left): We perform a 13-fold evaluation to assess each method's ability to handle unseen animal categories. OOD to an unseen category and rig (right): Note that MotionBert is constrained to rigs with the same or fewer joints as those seen during training and hence cannot handle unseen rigs with more joints. Our method can handle generalization to both unseen category and unseen rig more effectively.
  • Figure 5: OOD generalization on an unseen Bunny category from two different views: Our method provides significantly more accurate 3D predictions compared to 3D-LFM. In this visualization, blue represents the predicted 3D points whereas the orange denotes the ground truth.
  • ...and 3 more figures