Table of Contents
Fetching ...

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics

Tze Ho Elden Tse, Runyang Feng, Linfang Zheng, Jiho Park, Yixing Gao, Jihie Kim, Ales Leonardis, Hyung Jin Chang

TL;DR

The paper tackles joint 3D hand-object reconstruction and interaction recognition from egocentric RGB videos, introducing a collaborative two-branch framework that fuses 3D geometric cues with appearance features. Superquadrics are employed as a compact, template-free 3D object representation, enabling dense object geometry recovery and improved action recognition, especially under compositional, unseen-object splits. The methodology integrates a Transformer-based appearance branch, a geometric branch that first reconstructs object shapes via superquadrics and then predicts hand poses, and a compositional reasoning module that predicts verb and noun labels before an interaction decoder for final action classification. Extensive experiments on H$2$O and FPHA show state-of-the-art performance in both standard and compositional settings, highlighting the value of explicit 3D geometric reasoning for generalization to unseen objects and actions. The work advances template-free 3D hand-object understanding and demonstrates practical impact for AR/VR and embodied AI, while also outlining limitations related to shape complexity and template quality that future work could address with deformable shape models and articulated object representations.

Abstract

With the availability of egocentric 3D hand-object interaction datasets, there is increasing interest in developing unified models for hand-object pose estimation and action recognition. However, existing methods still struggle to recognise seen actions on unseen objects due to the limitations in representing object shape and movement using 3D bounding boxes. Additionally, the reliance on object templates at test time limits their generalisability to unseen objects. To address these challenges, we propose to leverage superquadrics as an alternative 3D object representation to bounding boxes and demonstrate their effectiveness on both template-free object reconstruction and action recognition tasks. Moreover, as we find that pure appearance-based methods can outperform the unified methods, the potential benefits from 3D geometric information remain unclear. Therefore, we study the compositionality of actions by considering a more challenging task where the training combinations of verbs and nouns do not overlap with the testing split. We extend H2O and FPHA datasets with compositional splits and design a novel collaborative learning framework that can explicitly reason about the geometric relations between hands and the manipulated object. Through extensive quantitative and qualitative evaluations, we demonstrate significant improvements over the state-of-the-arts in (compositional) action recognition.

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics

TL;DR

The paper tackles joint 3D hand-object reconstruction and interaction recognition from egocentric RGB videos, introducing a collaborative two-branch framework that fuses 3D geometric cues with appearance features. Superquadrics are employed as a compact, template-free 3D object representation, enabling dense object geometry recovery and improved action recognition, especially under compositional, unseen-object splits. The methodology integrates a Transformer-based appearance branch, a geometric branch that first reconstructs object shapes via superquadrics and then predicts hand poses, and a compositional reasoning module that predicts verb and noun labels before an interaction decoder for final action classification. Extensive experiments on HO and FPHA show state-of-the-art performance in both standard and compositional settings, highlighting the value of explicit 3D geometric reasoning for generalization to unseen objects and actions. The work advances template-free 3D hand-object understanding and demonstrates practical impact for AR/VR and embodied AI, while also outlining limitations related to shape complexity and template quality that future work could address with deformable shape models and articulated object representations.

Abstract

With the availability of egocentric 3D hand-object interaction datasets, there is increasing interest in developing unified models for hand-object pose estimation and action recognition. However, existing methods still struggle to recognise seen actions on unseen objects due to the limitations in representing object shape and movement using 3D bounding boxes. Additionally, the reliance on object templates at test time limits their generalisability to unseen objects. To address these challenges, we propose to leverage superquadrics as an alternative 3D object representation to bounding boxes and demonstrate their effectiveness on both template-free object reconstruction and action recognition tasks. Moreover, as we find that pure appearance-based methods can outperform the unified methods, the potential benefits from 3D geometric information remain unclear. Therefore, we study the compositionality of actions by considering a more challenging task where the training combinations of verbs and nouns do not overlap with the testing split. We extend H2O and FPHA datasets with compositional splits and design a novel collaborative learning framework that can explicitly reason about the geometric relations between hands and the manipulated object. Through extensive quantitative and qualitative evaluations, we demonstrate significant improvements over the state-of-the-arts in (compositional) action recognition.
Paper Structure (46 sections, 2 equations, 9 figures, 8 tables)

This paper contains 46 sections, 2 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Our method jointly reconstructs hand-object meshes without object instance-specific templates and recognises interaction from egocentric RGB video. We further consider a more challenging problem scenario, compositional action recognition, where combinations of verb (in green) and noun (in red) are unseen during training. Our model is designed for generalising action recognition by explicitly leveraging $3$D geometric information.
  • Figure 2: Overview of our approach. It first take RGB videos as input and produce per-frame spatial features $\mathbf{x}$ using a CNN backbone. Then, the appearance branch (bottom) applies positional encoding to $\mathbf{x}$ and combines it with a learnable token $\mathbf{T}_{\text{img}}$ before feeding into a Transformer encoder. Similarly, the geometric branch (top) extracts geometric features $\mathbf{T}_{\text{geo}}$ from flatten spatial features $\mathbf{F}_{\text{img}}$ using a Transformer decoder. The features from both branches are combined to predict superquadrics and object category. In addition, the geometric features are aggregated through another Transformer encoder to create global context-aware features between object shape and hand poses. The aggregated geometric features $\mathbf{x}_{\text{geometric}}$ and verb token features $\mathbf{T}_{\text{verb}}$ from this encoder are used to predict hand pose and action verb. Finally, the action class is predicted by feeding $\mathbf{x}_{\text{geometric}}$ into a cross-attention mechanism with the aggregated spatial representation $\mathbf{x}_{\text{appearance}}$ through a Transformer decoder.
  • Figure 3: Qualitative examples of convex superquadrics. We show that superquadrics can model diverse objects by varying the shape parameters, $\epsilon_1$ (y-axis) and $\epsilon_2$ (x-axis).
  • Figure 4: Qualitative examples of superquadrics. We extract superquadrics from everyday objects obtained from YCBcalli2015ycb, ShapeNetshapenet2015, FPHAgarcia2018first and H$2$Okwon2021h2o datasets. We show that superquadrics have sufficient expressiveness to represent everyday objects. We also present an example failure case in the red box.
  • Figure 5: Qualitative examples on H$2$O. We show that our model can recover plausible interaction across different object categories and hand-object configurations without object templates. We present additional qualitative examples and failure cases in the supplementary.
  • ...and 4 more figures