What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Yuxiao Wang; Yu Lei; Wolin Liang; Weiying Xue; Zhenao Wei; Nan Zhuang; Qi Liu

What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Yuxiao Wang, Yu Lei, Wolin Liang, Weiying Xue, Zhenao Wei, Nan Zhuang, Qi Liu

Abstract

People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component. The code and dataset will be released upon publication.

What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Abstract

What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Abstract

Paper Structure

Table of Contents

Figures (4)