A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap
Lijun Zhang, Wei Suo, Peng Wang, Yanning Zhang
TL;DR
Rare human-object interactions detection is hindered by long-tail data distributions and a domain gap between AI-generated and real HOI data. The authors propose CEFA, a plug-and-play framework with an Instance Feature Alignment Module and a Context Enhancement Module to bridge this gap by aligning instance-level HOI features and enriching contextual cues through context-aware image reconstruction. Prototype Instance Alignment uses a graph-based prototype approach to aggregate instance information, while the context module reconstructs masked generated images under guidance from original-image features, supervised by a context loss. Across HICO-Det and V-COCO, CEFA yields consistent gains for rare categories when combined with multiple HOI baselines, and ablations confirm complementary contributions from both modules, enabling scalable use of generated data for robust HOI detection in imbalanced settings.
Abstract
Human-object interactions (HOI) detection aims at capturing human-object pairs in images and corresponding actions. It is an important step toward high-level visual reasoning and scene understanding. However, due to the natural bias from the real world, existing methods mostly struggle with rare human-object pairs and lead to sub-optimal results. Recently, with the development of the generative model, a straightforward approach is to construct a more balanced dataset based on a group of supplementary samples. Unfortunately, there is a significant domain gap between the generated data and the original data, and simply merging the generated images into the original dataset cannot significantly boost the performance. To alleviate the above problem, we present a novel model-agnostic framework called \textbf{C}ontext-\textbf{E}nhanced \textbf{F}eature \textbf{A}lignment (CEFA) module, which can effectively align the generated data with the original data at the feature level and bridge the domain gap. Specifically, CEFA consists of a feature alignment module and a context enhancement module. On one hand, considering the crucial role of human-object pairs information in HOI tasks, the feature alignment module aligns the human-object pairs by aggregating instance information. On the other hand, to mitigate the issue of losing important context information caused by the traditional discriminator-style alignment method, we employ a context-enhanced image reconstruction module to improve the model's learning ability of contextual cues. Extensive experiments have shown that our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories\footnote{https://github.com/LijunZhang01/CEFA}.
