Table of Contents
Fetching ...

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

Boshen Xu, Sipeng Zheng, Qin Jin

TL;DR

The paper tackles cross-view egocentric hand-object interaction recognition by leveraging abundant multi-view third-person videos to learn view-agnostic representations. It introduces POV, a prompt-oriented framework that combines frame-level interactive masking prompts and token-level view-aware prompts within a vision transformer, trained with two core tasks—prompt-based action understanding and view-agnostic prompt tuning—and an optional egocentric fine-tuning stage. The authors define Ego-HOI-XView benchmarks on Assembly101 and H2O to evaluate cross-view transfer, and demonstrate that prompt tuning with a frozen backbone yields strong generalization to egocentric views, outperforming state-of-the-art zero-shot and few-shot baselines with notable efficiency benefits. Key contributions include the design of frame- and token-level prompts, a two-stage training strategy with a cross-view alignment objective, and comprehensive ablations showing the importance of CVA and IMP for fine-grained HOI recognition. The work offers a practical pathway to scalable Ego-HOI understanding with limited target-view data, enabling broader deployment in robotics and AR contexts.

Abstract

We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization. Our code is available at \url{https://github.com/xuboshen/pov_acmmm2023}.

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

TL;DR

The paper tackles cross-view egocentric hand-object interaction recognition by leveraging abundant multi-view third-person videos to learn view-agnostic representations. It introduces POV, a prompt-oriented framework that combines frame-level interactive masking prompts and token-level view-aware prompts within a vision transformer, trained with two core tasks—prompt-based action understanding and view-agnostic prompt tuning—and an optional egocentric fine-tuning stage. The authors define Ego-HOI-XView benchmarks on Assembly101 and H2O to evaluate cross-view transfer, and demonstrate that prompt tuning with a frozen backbone yields strong generalization to egocentric views, outperforming state-of-the-art zero-shot and few-shot baselines with notable efficiency benefits. Key contributions include the design of frame- and token-level prompts, a two-stage training strategy with a cross-view alignment objective, and comprehensive ablations showing the importance of CVA and IMP for fine-grained HOI recognition. The work offers a practical pathway to scalable Ego-HOI understanding with limited target-view data, enabling broader deployment in robotics and AR contexts.

Abstract

We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization. Our code is available at \url{https://github.com/xuboshen/pov_acmmm2023}.
Paper Structure (21 sections, 11 equations, 11 figures, 10 tables)

This paper contains 21 sections, 11 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Humans can learn egocentric hand-object interactions (Ego-HOI) knowledge by observing extensive third-person videos. Following this intuition, we utilize multi-view third-person videos for learning view-agnostic representation that can be transferred to egocentric view.
  • Figure 2: Illustration of our prompt-oriented view-agnostic learning framework, which trains a model through two optimization sub-tasks and one optional sub-task: (1) prompt-based action understanding, which incorporates interactive masking prompts into frames to pre-train the entire model on third-person videos; (2) view-agnostic prompt tuning, where only view-aware prompts are fine-tuned through cross-view alignment and cross-entropy loss. (3) egocentric fine-tuning, where the model is optionally fine-tuned on limited egocentric videos.
  • Figure 3: Illustration of per-class analysis to compare POV and MViT-3rd. The green line and red line refer to the improvement and degradation of POV over MViT-3rd respectively.
  • Figure 4: Illustration of t-SNE analysis of egocentric features, where each color represents one unique class.
  • Figure 5: Illustration of egocentric prediction examples.
  • ...and 6 more figures