POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World
Boshen Xu, Sipeng Zheng, Qin Jin
TL;DR
The paper tackles cross-view egocentric hand-object interaction recognition by leveraging abundant multi-view third-person videos to learn view-agnostic representations. It introduces POV, a prompt-oriented framework that combines frame-level interactive masking prompts and token-level view-aware prompts within a vision transformer, trained with two core tasks—prompt-based action understanding and view-agnostic prompt tuning—and an optional egocentric fine-tuning stage. The authors define Ego-HOI-XView benchmarks on Assembly101 and H2O to evaluate cross-view transfer, and demonstrate that prompt tuning with a frozen backbone yields strong generalization to egocentric views, outperforming state-of-the-art zero-shot and few-shot baselines with notable efficiency benefits. Key contributions include the design of frame- and token-level prompts, a two-stage training strategy with a cross-view alignment objective, and comprehensive ablations showing the importance of CVA and IMP for fine-grained HOI recognition. The work offers a practical pathway to scalable Ego-HOI understanding with limited target-view data, enabling broader deployment in robotics and AR contexts.
Abstract
We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization. Our code is available at \url{https://github.com/xuboshen/pov_acmmm2023}.
