Object Agnostic 3D Lifting in Space and Time
Christopher Fusco, Shin-Fang Ch'ng, Mosam Dabhi, Simon Lucey
TL;DR
This work introduces Object Agnostic 3D Lifting in Space and Time, a temporally aware lifting framework that generalizes across object categories without category-specific training. By embedding a temporal inductive bias into a transformer-based architecture, the model focuses on nearby frames to capture motion while modeling spatial joint relations via a dual-graph space encoder. A Procrustes-aligned decoder produces canonical 3D skeletons, with a loss that jointly optimizes geometry and velocity to ensure smooth sequence reconstruction. The authors also provide AnimalSyn3D, a synthetic 4D animal dataset with 13 categories to benchmark class-agnostic lifting, and demonstrate strong gains in SA-MPJPE, SA-MPVE, and FA-MPJPE under occlusion, unseen categories, and rig transfer scenarios. These results highlight improved generalization, data efficiency, and applicability to diverse real-world contexts where labeled 3D animal data are scarce.
Abstract
We present a spatio-temporal perspective on category-agnostic 3D lifting of 2D keypoints over a temporal sequence. Our approach differs from existing state-of-the-art methods that are either: (i) object-agnostic, but can only operate on individual frames, or (ii) can model space-time dependencies, but are only designed to work with a single object category. Our approach is grounded in two core principles. First, general information about similar objects can be leveraged to achieve better performance when there is little object-specific training data. Second, a temporally-proximate context window is advantageous for achieving consistency throughout a sequence. These two principles allow us to outperform current state-of-the-art methods on per-frame and per-sequence metrics for a variety of animal categories. Lastly, we release a new synthetic dataset containing 3D skeletons and motion sequences for a variety of animal categories.
