Table of Contents
Fetching ...

RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection

Jihwan Park, Chanhyeong Yang, Jinyoung Park, Taehoon Song, Hyunwoo J. Kim

Abstract

Weakly-supervised Human-Object Interaction (HOI) detection is essential for scalable scene understanding, as it learns interactions from only image-level annotations. Due to the lack of localization signals, prior works typically rely on an external object detector to generate candidate pairs and then infer their interactions through pairwise reasoning. However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it suffers from false positives arising from non-interactive combinations, which hinder accurate instance-level HOI reasoning. To address these issues, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module for efficient and accurate HOI reasoning. Under image-level supervision, RegFormer leverages spatially grounded signals as guidance for the reasoning process and promotes locality-aware interaction learning. By learning localized interaction cues, our module distinguishes humans, objects, and their interactions, enabling direct transfer from image-level interaction reasoning to precise and efficient instance-level reasoning without additional training. Our extensive experiments and analyses demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even achieves performance comparable to fully supervised models. Our code is available at https://github.com/mlvlab/RegFormer.

RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection

Abstract

Weakly-supervised Human-Object Interaction (HOI) detection is essential for scalable scene understanding, as it learns interactions from only image-level annotations. Due to the lack of localization signals, prior works typically rely on an external object detector to generate candidate pairs and then infer their interactions through pairwise reasoning. However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it suffers from false positives arising from non-interactive combinations, which hinder accurate instance-level HOI reasoning. To address these issues, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module for efficient and accurate HOI reasoning. Under image-level supervision, RegFormer leverages spatially grounded signals as guidance for the reasoning process and promotes locality-aware interaction learning. By learning localized interaction cues, our module distinguishes humans, objects, and their interactions, enabling direct transfer from image-level interaction reasoning to precise and efficient instance-level reasoning without additional training. Our extensive experiments and analyses demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even achieves performance comparable to fully supervised models. Our code is available at https://github.com/mlvlab/RegFormer.

Paper Structure

This paper contains 23 sections, 11 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of weakly-supervised HOI detection frameworks.(A) As the number of instance pairs increases, RegFormer shows only a marginal increase in inference time, whereas the ML-Decoder becomes significantly slower. (B) In addition, RegFormer effectively suppresses non-interactive human–object pairs, producing less false positives.
  • Figure 2: Overall framework of RegFormer. RegFormer unifies image-level and instance-level reasoning within a single framework by learning to capture spatial cues for interaction reasoning using only image-level labels. During training, pairwise instance encoder constructs a human–object (HO) query $q^{\text{ho}}$ by aggregating spatial features $x$ based on patch importance score $\alpha(p)$ associated with each human and object class. The resulting HO queries are processed by interaction decoder, which outputs the interaction classification scores $\hat{s}^{\text{a}}$. To further support locality-aware learning, we introduce a spatially aggregated interactiveness score${r}$, which acts as a gating signal for the interaction score $\hat{s}^\text{a}$ and receives explicit supervision. During instance-level HOI inference, given detected human and object instances from the object detector, region-aware mask$m(p)$ is applied to constrain both the HO queries and the interactiveness scores within their corresponding regions, enabling HOI detection without additional training.
  • Figure 3: Visualization of interactiveness score in HOI detection. Top row shows the interactiveness score for the human (red box), and the bottom row shows the score for the object (blue box). In row 1, for the human, the masked global interactiveness (0.01) corrects the inflated local interactiveness (0.768) caused by strong semantic alignment between the human and the patch, reducing the pairwise score of the non-interactive region (red box) to 0.008.
  • Figure 4: Qualitative results on patch importance score. We visualize the human patch importance score, $\alpha^\text{h}(p)$.
  • Figure 5: Efficiency comparison. (a) ML-Decoder baseline, (b) adding our HO$\rightarrow$I reasoning, (c) (b) + spatially grounded pairwise query, (d) (b) + interactiveness scoring, (e) full RegFormer with all components.
  • ...and 2 more figures