Contextualized Representation Learning for Effective Human-Object Interaction Detection

Zhehao Li; Yucheng Qian; Chong Wang; Yinghao Lu; Zhihao Yang; Jiafei Wu

Contextualized Representation Learning for Effective Human-Object Interaction Detection

Zhehao Li, Yucheng Qian, Chong Wang, Yinghao Lu, Zhihao Yang, Jiafei Wu

TL;DR

HOI detection often struggles with incomplete context, especially for tool-mediated interactions. The proposed Contextualized Representation Learning (CRL) integrates affordance-guided reasoning with contextualized prompts, leveraging multivariate relationships including the $<$human, tool, object$>$ triplet and attention-based language-vision alignment. It combines Multivariate Relationship Modeling (MRM) with Contextualized Prompt Learning (CPL) to fuse instance-level visual cues into prompts and prompts into HOI reasoning, achieving state-of-the-art results on HICO-Det and V-COCO. The approach highlights the practical value of incorporating tool affordances and regional visual context into HOI models, and provides open-source code for reproducibility.

Abstract

Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures <human, tool, object>. This enables our model to identify tool-dependent interactions such as 'filling'. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. The source code is available at https://github.com/lzzhhh1019/CRL.

Contextualized Representation Learning for Effective Human-Object Interaction Detection

TL;DR

Abstract

Contextualized Representation Learning for Effective Human-Object Interaction Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)