Table of Contents
Fetching ...

Contextualized Representation Learning for Effective Human-Object Interaction Detection

Zhehao Li, Yucheng Qian, Chong Wang, Yinghao Lu, Zhihao Yang, Jiafei Wu

TL;DR

HOI detection often struggles with incomplete context, especially for tool-mediated interactions. The proposed Contextualized Representation Learning (CRL) integrates affordance-guided reasoning with contextualized prompts, leveraging multivariate relationships including the $<$human, tool, object$>$ triplet and attention-based language-vision alignment. It combines Multivariate Relationship Modeling (MRM) with Contextualized Prompt Learning (CPL) to fuse instance-level visual cues into prompts and prompts into HOI reasoning, achieving state-of-the-art results on HICO-Det and V-COCO. The approach highlights the practical value of incorporating tool affordances and regional visual context into HOI models, and provides open-source code for reproducibility.

Abstract

Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures <human, tool, object>. This enables our model to identify tool-dependent interactions such as 'filling'. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. The source code is available at https://github.com/lzzhhh1019/CRL.

Contextualized Representation Learning for Effective Human-Object Interaction Detection

TL;DR

HOI detection often struggles with incomplete context, especially for tool-mediated interactions. The proposed Contextualized Representation Learning (CRL) integrates affordance-guided reasoning with contextualized prompts, leveraging multivariate relationships including the human, tool, object triplet and attention-based language-vision alignment. It combines Multivariate Relationship Modeling (MRM) with Contextualized Prompt Learning (CPL) to fuse instance-level visual cues into prompts and prompts into HOI reasoning, achieving state-of-the-art results on HICO-Det and V-COCO. The approach highlights the practical value of incorporating tool affordances and regional visual context into HOI models, and provides open-source code for reproducibility.

Abstract

Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures <human, tool, object>. This enables our model to identify tool-dependent interactions such as 'filling'. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. The source code is available at https://github.com/lzzhhh1019/CRL.

Paper Structure

This paper contains 17 sections, 19 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Contextualized representations in HOI. (a) Tool affordances (e.g., bottle's pourable) help distinguish complex interactions (e.g., human fill cup) from direct human–object relations (e.g., human hold cup). (b) Contextualized alignment with instance categories (e.g., cup) and corresponding visual features narrows down the potential actions to relevant ones (e.g., fill, hold).
  • Figure 2: Overall architecture of our Contextualized Representation Learning Network, consisting of Multivariate Relationship Modeling (MRM) and Contextualized Prompt Learning (CPL). MRM constructs unary, binary and ternary token sets from regional features to model HOIs. CPL builds a category-aware learnable prompt, fused with diverse contextual visual features. Their combined outputs are utilized for interaction prediction. The structure of the binary/ternary/contextual decoder is shown in the bottom right.
  • Figure 3: Comparison of per-category accuracy between CRL-B and PViC zhang2023exploring on HICO-Det-HTO.
  • Figure 4: Comparison of model performance with respect to learnable parameters and training epochs.
  • Figure 5: Qualitative results on HICO-Det test set with fine-tuned DETR-R50 as the object detector. Bounding boxes of humans and objects are drawn with blue and green boxes. The textual annotation below the figure represents the ground truth. N/I denotes no interaction.
  • ...and 1 more figures