Table of Contents
Fetching ...

GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes

Pradyumn Goyal, Dmitry Petrov, Sheldon Andrews, Yizhak Ben-Shabat, Hsueh-Ti Derek Liu, Evangelos Kalogerakis

TL;DR

GEOPARD addresses the problem of predicting articulation parameters for 3D shapes from a single snapshot by learning articulation-aware features through a transformer architecture. It introduces a geometric pretraining strategy that automatically generates physically valid candidate articulations via a geometry-driven search, enabling label-efficient learning before fine-tuning on annotated data. The method achieves state-of-the-art articulation inference on PartNet-Mobility, benefiting from learnable queries, context-aware part representations, and specialized decoders for pivot, axis, and motion type, with rigorous pruning to ensure physical plausibility. This approach reduces reliance on manual annotations and enhances generalization across diverse object categories and kinematic hierarchies, making it practical for digital twins and interactive 3D understanding.

Abstract

We present GEOPARD, a transformer-based architecture for predicting articulation from a single static snapshot of a 3D shape. The key idea of our method is a pretraining strategy that allows our transformer to learn plausible candidate articulations for 3D shapes based on a geometric-driven search without manual articulation annotation. The search automatically discovers physically valid part motions that do not cause detachments or collisions with other shape parts. Our experiments indicate that this geometric pretraining strategy, along with carefully designed choices in our transformer architecture, yields state-of-the-art results in articulation inference in the PartNet-Mobility dataset.

GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes

TL;DR

GEOPARD addresses the problem of predicting articulation parameters for 3D shapes from a single snapshot by learning articulation-aware features through a transformer architecture. It introduces a geometric pretraining strategy that automatically generates physically valid candidate articulations via a geometry-driven search, enabling label-efficient learning before fine-tuning on annotated data. The method achieves state-of-the-art articulation inference on PartNet-Mobility, benefiting from learnable queries, context-aware part representations, and specialized decoders for pivot, axis, and motion type, with rigorous pruning to ensure physical plausibility. This approach reduces reliance on manual annotations and enhances generalization across diverse object categories and kinematic hierarchies, making it practical for digital twins and interactive 3D understanding.

Abstract

We present GEOPARD, a transformer-based architecture for predicting articulation from a single static snapshot of a 3D shape. The key idea of our method is a pretraining strategy that allows our transformer to learn plausible candidate articulations for 3D shapes based on a geometric-driven search without manual articulation annotation. The search automatically discovers physically valid part motions that do not cause detachments or collisions with other shape parts. Our experiments indicate that this geometric pretraining strategy, along with carefully designed choices in our transformer architecture, yields state-of-the-art results in articulation inference in the PartNet-Mobility dataset.

Paper Structure

This paper contains 38 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: GEOPARD allows to predict articulation parameters for diverse object categories and complex kinematic hierarchies. Key idea of our method is usage of geometrically valid articulations as form of self-supervision. Using it, we pretrain our model, followed by fine-tuning on articulated shape datasets with ground truth annotations resulting in precise articulation inference.
  • Figure 2: GEOPARD overview. First, we learn part feature representations (a) from the part points along with shape context representation (b). Second, we enhance the part-level feature representations with the shape context (c). Third, the representations are aggregated to a compact, articulation-aware part feature vector (d), which is used to predict the part articulation through a set of three dedicated decoding branches: part pivot prediction (e); part motion axis prediction (f); motion type prediction (g).
  • Figure 3: For a segmented input (left), we compute a set of possible articulations, reject the ones that introduce detachments or collisions to the rest of the part (right), and keep the valid candidate articulations (middle) for our pretraining.
  • Figure 4: Qualitative Comparisons (with labels) indicates parts predicted or labeled as revolute, indicates parts predicted or labeled as prismatic, denotes input parts. Results showcase that our model predicts motion type and axis direction (Row 1) and revolute points (Rows 2-3) with improved performance.
  • Figure 5: Qualitative comparison (without labels). are parts predicted or labeled as revolute, are parts predicted or labeled as prismatic, are input parts. Predicted axes are shown with an arrow ($\uparrow$). While baselines based on part abstractions struggle to predict plausible articulation parameters, our base model, using fine grained point features, produces articulation parameters closely matching the ground truth - which are further enhanced by our pretraining strategy, supplying geometric and articulation priors refined during fine-tuning.
  • ...and 1 more figures