Table of Contents
Fetching ...

FreeA: Human-object Interaction Detection using Free Annotation Labels

Qi Liu, Yuxiao Wang, Xinyu Jiang, Wolin Liang, Zhenao Wei, Yu Lei, Nan Zhuang, Weiying Xue

TL;DR

FreeA tackles HOI detection without manual annotations by leveraging language-driven, self-adaptive labeling. It uses a three-stage pipeline—Candidate Image Construction (CIC), Human-Object Potential Interaction Mining (PIM) with CLIP-based text-image alignment, and Human-Object Interaction Inference (HII) with Interaction Correlation Matching and Prior Knowledge-based Masking—to generate HOI labels for training. The approach demonstrates state-of-the-art performance among weakly supervised HOI methods on HICO-Det and V-COCO, with strong results even in some fully supervised settings, while providing extensive ablations and visualizations. By eliminating manual labeling, FreeA offers a scalable, domain-adaptive solution for HOI detection with practical impact on reducing annotation costs and enabling robust interaction understanding in images.

Abstract

Recent human-object interaction (HOI) detection methods depend on extensively annotated image datasets, which require a significant amount of manpower. In this paper, we propose a novel self-adaptive, language-driven HOI detection method, termed FreeA. This method leverages the adaptability of the text-image model to generate latent HOI labels without requiring manual annotation. Specifically, FreeA aligns image features of human-object pairs with HOI text templates and employs a knowledge-based masking technique to decrease improbable interactions. Furthermore, FreeA implements a proposed method for matching interaction correlations to increase the probability of actions associated with a particular action, thereby improving the generated HOI labels. Experiments on two benchmark datasets showcase that FreeA achieves state-of-the-art performance among weakly supervised HOI competitors. Our proposal gets +\textbf{13.29} (\textbf{159\%$\uparrow$}) mAP and +\textbf{17.30} (\textbf{98\%$\uparrow$}) mAP than the newest ``Weakly'' supervised model, and +\textbf{7.19} (\textbf{28\%$\uparrow$}) mAP and +\textbf{14.69} (\textbf{34\%$\uparrow$}) mAP than the latest ``Weakly+'' supervised model, respectively, on HICO-DET and V-COCO datasets, more accurate in localizing and classifying the interactive actions. The source code will be made public.

FreeA: Human-object Interaction Detection using Free Annotation Labels

TL;DR

FreeA tackles HOI detection without manual annotations by leveraging language-driven, self-adaptive labeling. It uses a three-stage pipeline—Candidate Image Construction (CIC), Human-Object Potential Interaction Mining (PIM) with CLIP-based text-image alignment, and Human-Object Interaction Inference (HII) with Interaction Correlation Matching and Prior Knowledge-based Masking—to generate HOI labels for training. The approach demonstrates state-of-the-art performance among weakly supervised HOI methods on HICO-Det and V-COCO, with strong results even in some fully supervised settings, while providing extensive ablations and visualizations. By eliminating manual labeling, FreeA offers a scalable, domain-adaptive solution for HOI detection with practical impact on reducing annotation costs and enabling robust interaction understanding in images.

Abstract

Recent human-object interaction (HOI) detection methods depend on extensively annotated image datasets, which require a significant amount of manpower. In this paper, we propose a novel self-adaptive, language-driven HOI detection method, termed FreeA. This method leverages the adaptability of the text-image model to generate latent HOI labels without requiring manual annotation. Specifically, FreeA aligns image features of human-object pairs with HOI text templates and employs a knowledge-based masking technique to decrease improbable interactions. Furthermore, FreeA implements a proposed method for matching interaction correlations to increase the probability of actions associated with a particular action, thereby improving the generated HOI labels. Experiments on two benchmark datasets showcase that FreeA achieves state-of-the-art performance among weakly supervised HOI competitors. Our proposal gets +\textbf{13.29} (\textbf{159\%}) mAP and +\textbf{17.30} (\textbf{98\%}) mAP than the newest ``Weakly'' supervised model, and +\textbf{7.19} (\textbf{28\%}) mAP and +\textbf{14.69} (\textbf{34\%}) mAP than the latest ``Weakly+'' supervised model, respectively, on HICO-DET and V-COCO datasets, more accurate in localizing and classifying the interactive actions. The source code will be made public.
Paper Structure (11 sections, 12 equations, 4 figures, 6 tables)

This paper contains 11 sections, 12 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: HOI methods overview: (a) Fully supervised HOI models. Labels consist of human bounding boxes, object bounding boxes, object categories, and interaction actions of each human-object pair. (b) HOI weakly supervised. It divides into twofolds: weakly+ (using $\langle$"interaction", "object"$\rangle$ labels), and weakly (using $\langle$"interaction"$\rangle$ labels). (c) Our method, i.e., FreeA, automatically generates HOI labels for HOI model training without the need for any manual annotation.
  • Figure 2: Method overview. Starting from an existing HOI model, we apply the candidate image construction to extract humans and objects by detection and segmentation, and establish one-to-one human-object pairing. The human-object potential interaction mining module gets initial HOI interaction labels from candidate image pairs, and uses text-image matching model for domain adaptation. The human-object interaction inference module further refines these interaction labels by using $a$$prior$ knowledge-based mask to eliminate implausible actions and using the interaction correlation matching method to enhance relevant action similarity. Finally, HOI labels are generated for model training through a dynamic threshold selector and interaction action filter.
  • Figure 3: Details of the interaction correlation matching method.
  • Figure 4: Comparison of HOI labels. GL and GT represents generated labels and ground truth, respectively. The red and blue rectangles are bounding boxes for the human and object, and the green lines represent the connection between their centers. Green text indicates correct interactions.