Table of Contents
Fetching ...

Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection

Weibo Jiang, Weihong Ren, Jiandong Tian, Liangqiong Qu, Zhiyong Wang, Honghai Liu

TL;DR

This work tackles HOI detection by addressing action ambiguity through Self- and Cross-Triplet Correlations (SCTC). It models each candidate HOI triplet as a graph to capture self-triplet relations and constructs a triplet-level graph to exploit dependencies across proposals, with instance, semantic, and layout relations informing cross-triplet edges. A CLIP-based knowledge distillation mechanism enhances the interaction feature, guiding the model with vision-language semantics. Empirical results on HICO-DET and V-COCO show SCTC achieves state-of-the-art performance, with ablations confirming the effectiveness of STA, CTD, and CLIP KD in reducing action ambiguity and improving HOI reasoning. The approach offers a robust, scalable framework for integrating multi-level relationships in HOI detection and demonstrates the practical impact of vision-language supervision in structured scene understanding.

Abstract

Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of <human, object, action>. Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC.

Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection

TL;DR

This work tackles HOI detection by addressing action ambiguity through Self- and Cross-Triplet Correlations (SCTC). It models each candidate HOI triplet as a graph to capture self-triplet relations and constructs a triplet-level graph to exploit dependencies across proposals, with instance, semantic, and layout relations informing cross-triplet edges. A CLIP-based knowledge distillation mechanism enhances the interaction feature, guiding the model with vision-language semantics. Empirical results on HICO-DET and V-COCO show SCTC achieves state-of-the-art performance, with ablations confirming the effectiveness of STA, CTD, and CLIP KD in reducing action ambiguity and improving HOI reasoning. The approach offers a robust, scalable framework for integrating multi-level relationships in HOI detection and demonstrates the practical impact of vision-language supervision in structured scene understanding.

Abstract

Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of <human, object, action>. Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC.
Paper Structure (24 sections, 10 equations, 3 figures, 6 tables)

This paper contains 24 sections, 10 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Architecture comparison of different HOI detection pipelines. (a) Traditional HOI detection methods focus on exploring multiple features and fuse them by self-triplet fusion (e.g. body-part attention); (b) our SCTC jointly explores Self- and Cross-triplet Correlations, aiming to eliminate action ambiguity and promote scene understanding.
  • Figure 2: The overall pipeline of SCTC. For an input image, the instance-aware module is first used to extract instance-level features (appearance and semantics). Then, the interaction-aware module match human-object pairs and fuse them together to generate interaction feature. Next, Self-Triplet Aggregation (STA) is employed to explore self-triplet attentions, while the Cross-Triplet Dependency (CTD) is used to build connections across different HOI triplets. Finally, the action decoder is utilized to predict HOI triplets. Also, Knowledge Distillation (KD) is used to transfer text embeddings from CLIP to interaction feature.
  • Figure 3: Generation of the edges in CTD, where the $\mathord{\text{\textcircled{\newline$\text{\scriptsize M}$}}}$ indicates the MLP.