FreeA: Human-object Interaction Detection using Free Annotation Labels
Qi Liu, Yuxiao Wang, Xinyu Jiang, Wolin Liang, Zhenao Wei, Yu Lei, Nan Zhuang, Weiying Xue
TL;DR
FreeA tackles HOI detection without manual annotations by leveraging language-driven, self-adaptive labeling. It uses a three-stage pipeline—Candidate Image Construction (CIC), Human-Object Potential Interaction Mining (PIM) with CLIP-based text-image alignment, and Human-Object Interaction Inference (HII) with Interaction Correlation Matching and Prior Knowledge-based Masking—to generate HOI labels for training. The approach demonstrates state-of-the-art performance among weakly supervised HOI methods on HICO-Det and V-COCO, with strong results even in some fully supervised settings, while providing extensive ablations and visualizations. By eliminating manual labeling, FreeA offers a scalable, domain-adaptive solution for HOI detection with practical impact on reducing annotation costs and enabling robust interaction understanding in images.
Abstract
Recent human-object interaction (HOI) detection methods depend on extensively annotated image datasets, which require a significant amount of manpower. In this paper, we propose a novel self-adaptive, language-driven HOI detection method, termed FreeA. This method leverages the adaptability of the text-image model to generate latent HOI labels without requiring manual annotation. Specifically, FreeA aligns image features of human-object pairs with HOI text templates and employs a knowledge-based masking technique to decrease improbable interactions. Furthermore, FreeA implements a proposed method for matching interaction correlations to increase the probability of actions associated with a particular action, thereby improving the generated HOI labels. Experiments on two benchmark datasets showcase that FreeA achieves state-of-the-art performance among weakly supervised HOI competitors. Our proposal gets +\textbf{13.29} (\textbf{159\%$\uparrow$}) mAP and +\textbf{17.30} (\textbf{98\%$\uparrow$}) mAP than the newest ``Weakly'' supervised model, and +\textbf{7.19} (\textbf{28\%$\uparrow$}) mAP and +\textbf{14.69} (\textbf{34\%$\uparrow$}) mAP than the latest ``Weakly+'' supervised model, respectively, on HICO-DET and V-COCO datasets, more accurate in localizing and classifying the interactive actions. The source code will be made public.
