Table of Contents
Fetching ...

Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

Wenxuan Ji, Haichao Shi, Xiao-Yu Zhang

TL;DR

The paper tackles HOI detection by shifting from Transformer-only attention to explicit relational reasoning through a multimodal graph network. It introduces MGNM, a two-stage HOI detector that builds explicit human-object pair graphs and applies a four-stage Multi-level Feature Interaction (MFI) to fuse low-level geometry with high-level visual and language cues via CLIP, followed by a Transformer decoder to predict HOI triplets $\langle \text{human}, \text{action}, \text{object} \rangle$. Its interaction-centric prompts and multimodal fusion enable rich cross-modal information propagation, yielding state-of-the-art results on HICO-DET and V-COCO and improving rare/non-rare class balance when combined with strong detectors. The work demonstrates the value of explicit multimodal graph modeling for robust HOI understanding and provides practical guidance for mitigating long-tail bias in HOI benchmarks.

Abstract

Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level visual and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art (SOTA) performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.

Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

TL;DR

The paper tackles HOI detection by shifting from Transformer-only attention to explicit relational reasoning through a multimodal graph network. It introduces MGNM, a two-stage HOI detector that builds explicit human-object pair graphs and applies a four-stage Multi-level Feature Interaction (MFI) to fuse low-level geometry with high-level visual and language cues via CLIP, followed by a Transformer decoder to predict HOI triplets . Its interaction-centric prompts and multimodal fusion enable rich cross-modal information propagation, yielding state-of-the-art results on HICO-DET and V-COCO and improving rare/non-rare class balance when combined with strong detectors. The work demonstrates the value of explicit multimodal graph modeling for robust HOI understanding and provides practical guidance for mitigating long-tail bias in HOI benchmarks.

Abstract

Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level visual and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art (SOTA) performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.

Paper Structure

This paper contains 23 sections, 9 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of existing Transformer-based methods and our GNN-based method. Transformer-based methods typically perform feature extraction prior to matching without explicit relational modeling, making it difficult for them to identify complex interactions. In contrast, our GNN-based method firstly constructs human-object pairs via a general matching mechanism and then applies multi-level feature interaction mechanism to enable explicit GNN-based relational reasoning. H, O, and f denote human, object, and multimodal features, respectively.
  • Figure 2: Overview of our MGNM framework. The core of our proposed MGNM framework is a four-stage multimodal graph network. (1) Spatial Stage: For each candidate pair, low-level spatial features derived from their bounding boxes and 3D location prior are used to initialize pairwise representations. After that, these spatial features are utilized to construct a weighted matrix that regulates feature interactions in the subsequent stages. (2) Visual Stage: High-level visual semantic features are extracted using the CLIP image encoder. These visual cues further enrich the interactions among human-object pairs. (3) Textual Stage: The CLIP text encoder is adopted to obtain semantic cues for the corresponding human and object instances with the designed object-centric prompt. (4) Interaction Stage: In the final stage, collaborating with the interaction-centric prompt, the model captures high-level interaction features between human-object pairs, facilitating more effective relational reasoning within the graph structure. Finally, the refined pairwise representations are utilized as queries in the Transformer decoder to predict the final HOI triplets.
  • Figure 3: Qualitative results on the HICO-DET dataset. Figures \ref{['fig:fails:1a']}-\ref{['fig:fails:1c']} illustrate successful predictions across a variety of challenging scenarios. Figure \ref{['fig:fails:3a']} presents a representative failure case, and Figures \ref{['fig:fails:3b']} and \ref{['fig:fails:3c']} visualize its corresponding attention maps.
  • Figure 4: Comparison of our method and two related GNN-based methods on rare and non-rare classes.