Table of Contents
Fetching ...

3D-LFM: Lifting Foundation Model

Mosam Dabhi, Laszlo A. Jeni, Simon Lucey

TL;DR

3D-LFM proposes a unified, object-agnostic 2D landmarks-to-3D lifting framework that operates across 30+ categories in a single model. It leverages a permutation-equivariant graph transformer, Token Positional Encoding, and Procrustes-based alignment to handle varying keypoint counts, occlusions, and unseen categories, achieving state-of-the-art performance and strong out-of-distribution generalization. The approach unifies learning across diverse object categories, demonstrates robust OOD and rig-transfer capabilities, and validates the model as a potential foundation model for diverse 2D-3D lifting tasks. This work holds practical significance for broad 3D reconstruction applications in AR, robotics, and beyond, by enabling scalable, cross-category 2D-3D lifting without object-specific semantics.

Abstract

The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3DPO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data -- significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.

3D-LFM: Lifting Foundation Model

TL;DR

3D-LFM proposes a unified, object-agnostic 2D landmarks-to-3D lifting framework that operates across 30+ categories in a single model. It leverages a permutation-equivariant graph transformer, Token Positional Encoding, and Procrustes-based alignment to handle varying keypoint counts, occlusions, and unseen categories, achieving state-of-the-art performance and strong out-of-distribution generalization. The approach unifies learning across diverse object categories, demonstrates robust OOD and rig-transfer capabilities, and validates the model as a potential foundation model for diverse 2D-3D lifting tasks. This work holds practical significance for broad 3D reconstruction applications in AR, robotics, and beyond, by enabling scalable, cross-category 2D-3D lifting without object-specific semantics.

Abstract

The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3DPO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data -- significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.
Paper Structure (21 sections, 13 equations, 8 figures, 3 tables)

This paper contains 21 sections, 13 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 0: Overview:(a) This figure shows the 3D-LFM's ability in lifting 2D landmarks into 3D structures across an array of over 30 diverse categories, from human body parts, to a plethora of animals and everyday common objects. The lower portion shows the actual 3D reconstructions by our model, with red lines representing the ground truth and blue lines showing the 3D-LFM's predictions. (b) This figure displays the model's training data distribution on a logarithmic scale, highlighting that inspite of 3D-LFM being trained on imbalanced datasets, it preserves the performance across individual categories.
  • Figure 1: Overview of the 3D Lifting Foundation Model (3D-LFM) architecture: The process begins with the input 2D keypoints undergoing Token Positional Encoding (TPE) before being processed by a series of graph-based transformer layers. The resulting features are then decoded through an MLP into a canonical 3D shape. This shape is aligned to the ground truth (G.T. 3D) in the reference frame using a Procrustean method, with the Mean Squared Error (MSE) loss computed to guide the learning. The architecture captures both local and global contextual information, focusing on deformable structures while minimizing computational complexity.
  • Figure 2: 3D-LFM vs. C3DPO Performance: MPJPE comparisons using the PASCAL3D+ dataset, this figure demonstrates our model's adaptability in the absence of object-specific information, contrasting with C3DPO's increased error under the same conditions. The analysis confirms 3D-LFM's superiority across diverse object categories, reinforcing its potential for generalized 2D to 3D lifting.
  • Figure 3: Performance Comparison on H3WB Benchmark: This chart contrasts MPJPE errors for whole-body, body, face, aligned face, hand, and aligned hand categories within the H3WB benchmark h3wb. Our models, with and without Procrustes Alignment (Ours-PA), outperform current state-of-the-art (SOTA) methods, validating our approach's proficiency in 2D to 3D lifting tasks.
  • Figure 4: Generalization to unseen data: Figure showing 3D-LFM's proficiency in OOD 2D-3D lifting, effectively handling new, unseen categories, and rig generalization from Acinoset acinoset PASCAL3D+ pascal3dplus, and Panoptic studio panopticstudio with varying joint arrangements in top row. The bottom row presents in-the-wild data from the MBW dataset mbw, with red dots indicating input keypoints and blue stick figures showing the model’s 3D predictions from different angles.
  • ...and 3 more figures