Table of Contents
Fetching ...

What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Yuxiao Wang, Yu Lei, Wolin Liang, Weiying Xue, Zhenao Wei, Nan Zhuang, Qi Liu

Abstract

People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component. The code and dataset will be released upon publication.

What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Abstract

People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component. The code and dataset will be released upon publication.

Paper Structure

This paper contains 17 sections, 18 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The same object (e.g., cake or cup) can imply different actions depending on contact regions. For example, "eating" involves both hand and head, while "holding" involves only the hand. To bridge this gap, we propose joint modeling of What (action & object) and Where (contact body part).
  • Figure 2: The overall workflow of PaIR-Net. It comprises three branches: CPAM for multi-label body part contact prediction (upper part of Figure), PGCS for outputting contact region segmentation (middle part of Figure), and IIM for detecting human-object pairs and identifying interaction categories (lower part of Figure). To facilitate effective collaboration between contact understanding and action recognition, we design two key modules: the H-O RoI Enhancer and the Mask-Guided RoI Feature.
  • Figure 3: (a) The H-O RoI Enhancer module. It computes the minimum enclosing rectangle based on the human and object bounding boxes, and enhances the feature $\bm{F}_B$ responses within this region. (b) The structure of the Mask-Guided RoI Feature module. It utilizes $\bm{S}$ to extract the minimum enclosing contact region, crops the corresponding region from $\bm{F}_B$, and generates the contact feature encoding $\bm{F}_M$ through GAP and FC layers. Finally, $\bm{F}_M$ is fused with $\bm{D}_a$ to assist action classification.
  • Figure 4: Visualization results. Red and green bounding boxes represent the human and object, respectively. Blue text indicates the action category, and green text indicates the object category.