Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing

Yuejiao Su; Yi Wang; Lei Yao; Yawen Cui; Lap-Pui Chau

Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing

Yuejiao Su, Yi Wang, Lei Yao, Yawen Cui, Lap-Pui Chau

TL;DR

This work proposes an end-to-end Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Dynamic Query Generator, a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss, and achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets.

Abstract

A fine-grained understanding of egocentric human-environment interactions is crucial for developing next-generation embodied agents. One fundamental challenge in this area involves accurately parsing hands and active objects. While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms rely primarily on semantic cues or learnable parameters, demonstrating limited adaptability to changing active objects across varying input scenes; 2) previous transformer-based methods utilize pixel-level semantic features to iteratively refine queries during mask generation, which may introduce interaction-irrelevant content into the final embeddings; and 3) prevailing models are susceptible to "interaction illusion", producing physically inconsistent predictions. To address these issues, we propose an end-to-end Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss. The DQG explicitly grounds query initialization in the spatial dynamics of hand-object contact, enabling targeted generation of interaction-aware queries for hands and various active objects. The DFS fuses coarse interactive cues with semantic features, thereby suppressing interaction-irrelevant noise and emphasizing the learning of interactive relationships. The CoCo loss incorporates hand-object relationship constraints to enhance physical consistency in prediction. Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets, demonstrating its effectiveness and strong generalization ability. Code and models are publicly available at https://github.com/yuggiehk/InterFormer.

Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing

TL;DR

Abstract

Paper Structure (31 sections, 8 equations, 12 figures, 11 tables)

This paper contains 31 sections, 8 equations, 12 figures, 11 tables.

Introduction
Related Work
Methodology
Overview
Dynamic Query Generator
Dual-context Feature Selector
Conditional Co-Occurrence Loss
EXPERIMENTS
Datasets and Metrics
Comparisons with State-of-the-Art Methods
In-domain Comparison Results
Out-of-distribution Comparison Results.
Ablation Study
Visualization Results.
Conclusion
...and 16 more sections

Figures (12)

Figure 1: Model size vs. mIoU for InterFormer compared to other methods. Evaluations use EgoHOS in-domain (In-domain), EgoHOS out-of-domain (OOD Test 1), and mini-HOI4D (OOD Test 2) datasets.
Figure 2: Illustration of interaction illusion, in which segmentation results violate real-world causal dependencies between hands and objects.
Figure 3: Architecture of our end-to-end InterFormer. Given an input egocentric image, a backbone network first extracts global and multi-scale pixel-level features. We add an additional IPP branch to extract coarse boundary-guided representations that characterize the interaction. Subsequently, the DQG produces robust and dynamic queries by integrating interaction-relevant contextual information with learnable parameters. Finally, these queries and extracted features are fed into the InterFormer decoder, which employs the DFS to refine interaction-aware representations and generate the final segmentation masks. The overall end-to-end architecture is supervised by the classification loss $\mathcal{L}_{cls}$, dice loss $\mathcal{L}_{dic}$, cross entropy loss $\mathcal{L}_{ce}$, IPP loss $\mathcal{L}_{b}$, and CoCo loss $\mathcal{L}_{co}$.
Figure 4: Detailed architecture of Dual-context Feature Selector (DFS).
Figure 4: Ablation study results on the EgoHOS in-domain test set.
...and 7 more figures

Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing

TL;DR

Abstract

Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing

Authors

TL;DR

Abstract

Table of Contents

Figures (12)