Table of Contents
Fetching ...

A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap

Lijun Zhang, Wei Suo, Peng Wang, Yanning Zhang

TL;DR

Rare human-object interactions detection is hindered by long-tail data distributions and a domain gap between AI-generated and real HOI data. The authors propose CEFA, a plug-and-play framework with an Instance Feature Alignment Module and a Context Enhancement Module to bridge this gap by aligning instance-level HOI features and enriching contextual cues through context-aware image reconstruction. Prototype Instance Alignment uses a graph-based prototype approach to aggregate instance information, while the context module reconstructs masked generated images under guidance from original-image features, supervised by a context loss. Across HICO-Det and V-COCO, CEFA yields consistent gains for rare categories when combined with multiple HOI baselines, and ablations confirm complementary contributions from both modules, enabling scalable use of generated data for robust HOI detection in imbalanced settings.

Abstract

Human-object interactions (HOI) detection aims at capturing human-object pairs in images and corresponding actions. It is an important step toward high-level visual reasoning and scene understanding. However, due to the natural bias from the real world, existing methods mostly struggle with rare human-object pairs and lead to sub-optimal results. Recently, with the development of the generative model, a straightforward approach is to construct a more balanced dataset based on a group of supplementary samples. Unfortunately, there is a significant domain gap between the generated data and the original data, and simply merging the generated images into the original dataset cannot significantly boost the performance. To alleviate the above problem, we present a novel model-agnostic framework called \textbf{C}ontext-\textbf{E}nhanced \textbf{F}eature \textbf{A}lignment (CEFA) module, which can effectively align the generated data with the original data at the feature level and bridge the domain gap. Specifically, CEFA consists of a feature alignment module and a context enhancement module. On one hand, considering the crucial role of human-object pairs information in HOI tasks, the feature alignment module aligns the human-object pairs by aggregating instance information. On the other hand, to mitigate the issue of losing important context information caused by the traditional discriminator-style alignment method, we employ a context-enhanced image reconstruction module to improve the model's learning ability of contextual cues. Extensive experiments have shown that our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories\footnote{https://github.com/LijunZhang01/CEFA}.

A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap

TL;DR

Rare human-object interactions detection is hindered by long-tail data distributions and a domain gap between AI-generated and real HOI data. The authors propose CEFA, a plug-and-play framework with an Instance Feature Alignment Module and a Context Enhancement Module to bridge this gap by aligning instance-level HOI features and enriching contextual cues through context-aware image reconstruction. Prototype Instance Alignment uses a graph-based prototype approach to aggregate instance information, while the context module reconstructs masked generated images under guidance from original-image features, supervised by a context loss. Across HICO-Det and V-COCO, CEFA yields consistent gains for rare categories when combined with multiple HOI baselines, and ablations confirm complementary contributions from both modules, enabling scalable use of generated data for robust HOI detection in imbalanced settings.

Abstract

Human-object interactions (HOI) detection aims at capturing human-object pairs in images and corresponding actions. It is an important step toward high-level visual reasoning and scene understanding. However, due to the natural bias from the real world, existing methods mostly struggle with rare human-object pairs and lead to sub-optimal results. Recently, with the development of the generative model, a straightforward approach is to construct a more balanced dataset based on a group of supplementary samples. Unfortunately, there is a significant domain gap between the generated data and the original data, and simply merging the generated images into the original dataset cannot significantly boost the performance. To alleviate the above problem, we present a novel model-agnostic framework called \textbf{C}ontext-\textbf{E}nhanced \textbf{F}eature \textbf{A}lignment (CEFA) module, which can effectively align the generated data with the original data at the feature level and bridge the domain gap. Specifically, CEFA consists of a feature alignment module and a context enhancement module. On one hand, considering the crucial role of human-object pairs information in HOI tasks, the feature alignment module aligns the human-object pairs by aggregating instance information. On the other hand, to mitigate the issue of losing important context information caused by the traditional discriminator-style alignment method, we employ a context-enhanced image reconstruction module to improve the model's learning ability of contextual cues. Extensive experiments have shown that our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories\footnote{https://github.com/LijunZhang01/CEFA}.
Paper Structure (23 sections, 13 equations, 5 figures, 4 tables)

This paper contains 23 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) Performance comparison of rare category on HICO-Det dataset. "Ours" refers to our CEFA method. "Baseline" represents traditional HOI models such as CDN, GEN-VLKT and HOICLIP. "Baseline+Date" indicates the approach of merely merging the generated data with the original dataset to train the baseline model. (b) Visualization of original image features and generated image features extracted using HOICLIP encoder, represented by t-SNE plots. Red represents the generated images, blue represents the original images. (c) Visualization of features extracted by our model.
  • Figure 2: The architecture for our CEFA. It consists of two parts: (I.) Instance Feature Alignment Module (orange part) : This module aligns the human-object pairs by aggregating instance information. (II.) Context Enhancement Module(blue part) : This module uses a context-enhanced image reconstruction module to improve the model’s learning ability of contextual cues. The gray part in the figure represents the baseline model of HOI. Further details about PIAM can be found in Fig. \ref{['tu3']}.
  • Figure 3: The architecture of Prototype Instance Alignment Module. It consists of three steps. First, prototype selection is performed, then a graph is constructed based on prototype features and common features. Finally, the graph convolutional neural network is used to aggregate instance information.
  • Figure 4: The effects of different numbers of generated images for per rare class on the HICO-Det dataset.
  • Figure 5: Qualitative evaluation on the HICO-Det dataset. The first row represents the ground truth, the second row shows the predictions from the previous benchmark model HOICLIP, and the third row presents the results from our proposed method. In the figure, blue bounding boxes represent person, and green bounding boxes represent objects.