Efficient Human-Object-Interaction (EHOI) Detection via Interaction Label Coding and Conditional Decision

Tsung-Shan Yang; Yun-Cheng Wang; Chengwei Wei; Suya You; C. -C. Jay Kuo

Efficient Human-Object-Interaction (EHOI) Detection via Interaction Label Coding and Conditional Decision

Tsung-Shan Yang, Yun-Cheng Wang, Chengwei Wei, Suya You, C. -C. Jay Kuo

TL;DR

This work tackles HOI detection under long-tailed, imbalanced data and opacity in end-to-end models. It introduces EHOI, a two-stage detector that freezes an object detector in the first stage and uses four statistically grounded modules in the second stage to encode interaction labels with error-correcting codes and perform conditional decisions via bit-wise XGBoost classifiers, all within a Green Learning framework. The approach yields strong efficiency: drastically reduced model size and FLOPs while maintaining competitive mAP, and it provides interpretable, feedforward decision pathways. The results suggest that ECC-based coding and modular, probability-based reasoning can deliver practical, transparent HOI detection suitable for edge and mobile settings, with potential for broader image understanding applications.

Abstract

Human-Object Interaction (HOI) detection is a fundamental task in image understanding. While deep-learning-based HOI methods provide high performance in terms of mean Average Precision (mAP), they are computationally expensive and opaque in training and inference processes. An Efficient HOI (EHOI) detector is proposed in this work to strike a good balance between detection performance, inference complexity, and mathematical transparency. EHOI is a two-stage method. In the first stage, it leverages a frozen object detector to localize the objects and extract various features as intermediate outputs. In the second stage, the first-stage outputs predict the interaction type using the XGBoost classifier. Our contributions include the application of error correction codes (ECCs) to encode rare interaction cases, which reduces the model size and the complexity of the XGBoost classifier in the second stage. Additionally, we provide a mathematical formulation of the relabeling and decision-making process. Apart from the architecture, we present qualitative results to explain the functionalities of the feedforward modules. Experimental results demonstrate the advantages of ECC-coded interaction labels and the excellent balance of detection performance and complexity of the proposed EHOI method.

Efficient Human-Object-Interaction (EHOI) Detection via Interaction Label Coding and Conditional Decision

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 9 figures, 5 tables)

This paper contains 24 sections, 8 equations, 9 figures, 5 tables.

Introduction
Related Work
One-stage HOI Detection
Two-stage HOI Detection
Green Learning
EHOI Method
System Overview
Processing Modules in the Second Stage
Module A: Visual Features Construction
Module B: Hybrid Interaction Coding
Module C: Discriminant Features Selection
Module D: Conditional Decision on the Interaction Type
Experiments
Datasets
Parameter Settings
...and 9 more sections

Figures (9)

Figure 1: Illustration of challenges in the HOI problem with images from the HICO-DET dataset: (a) images labeled as 'no_interaction,' (b)-(d) three images with the same verb, 'wash,' but humans behave differently.
Figure 2: Complexity comparison between the proposed EHOI and several other state-of-the-art (SOTA) detectors for the HICO-DET dataset, where the x-axis is the model size in the log scale, the y-axis is mAP (%), and the bubble size is proportional to the inference FLOP numbers.
Figure 3: The occurrence of <human-interaction-object> labels, where if the number of annotations exceeds 1,000, it is clipped to 1,000 for simplicity. Half of the relationship labels in the HICO-DET dataset have fewer than 200 annotations.
Figure 4: The overall system diagram of the proposed EHOI. Its first stage is a pre-trained object detector. The main contributions of EHOI lie in the data processing pipeline in the second stage. It consists of four modules: A) visual features construction, B) interaction label coding, C) discriminant features selection, and D) conditional decision on the interaction type.
Figure 5: Visualization of DFT, where pink and orange dots represent the "0" and "1" binary labels, and the loss function is the weighted cross-entropy sum of samples in the left and right parts of the partition line.
...and 4 more figures

Efficient Human-Object-Interaction (EHOI) Detection via Interaction Label Coding and Conditional Decision

TL;DR

Abstract

Efficient Human-Object-Interaction (EHOI) Detection via Interaction Label Coding and Conditional Decision

Authors

TL;DR

Abstract

Table of Contents

Figures (9)