A Review of Human-Object Interaction Detection

Yuxiao Wang; Yu Lei; Li Cui; Weiying Xue; Qi Liu; Zhenao Wei

A Review of Human-Object Interaction Detection

Yuxiao Wang, Yu Lei, Li Cui, Weiying Xue, Qi Liu, Zhenao Wei

TL;DR

This survey addresses image-based HOI detection by formalizing the problem as predicting triplets $<$human, object, interaction$>$ and reviewing the landscape of datasets and architectures. It contrasts two-stage pipelines, which separate object/human detection from interaction classification, with one-stage, end-to-end methods that predict HOI triplets directly, and highlights emerging techniques such as zero-shot learning, weakly supervised learning, and large-language-model–driven representations. The compilation includes performance insights on standard benchmarks and discusses key challenges, including long-tail verb-object distributions, annotation costs, and the need for richer contextual information. The authors propose future directions, such as expanding dataset diversity, leveraging transfer learning, and developing more effective multi-task and context-aware HOI models to improve generalization and real-world applicability.

Abstract

Human-object interaction (HOI) detection plays a key role in high-level visual understanding, facilitating a deep comprehension of human activities. Specifically, HOI detection aims to locate the humans and objects involved in interactions within images or videos and classify the specific interactions between them. The success of this task is influenced by several key factors, including the accurate localization of human and object instances, as well as the correct classification of object categories and interaction relationships. This paper systematically summarizes and discusses the recent work in image-based HOI detection. First, the mainstream datasets involved in HOI relationship detection are introduced. Furthermore, starting with two-stage methods and end-to-end one-stage detection approaches, this paper comprehensively discusses the current developments in image-based HOI detection, analyzing the strengths and weaknesses of these two methods. Additionally, the advancements of zero-shot learning, weakly supervised learning, and the application of large-scale language models in HOI detection are discussed. Finally, the current challenges in HOI detection are outlined, and potential research directions and future trends are explored.

A Review of Human-Object Interaction Detection

TL;DR

This survey addresses image-based HOI detection by formalizing the problem as predicting triplets

human, object, interaction

and reviewing the landscape of datasets and architectures. It contrasts two-stage pipelines, which separate object/human detection from interaction classification, with one-stage, end-to-end methods that predict HOI triplets directly, and highlights emerging techniques such as zero-shot learning, weakly supervised learning, and large-language-model–driven representations. The compilation includes performance insights on standard benchmarks and discusses key challenges, including long-tail verb-object distributions, annotation costs, and the need for richer contextual information. The authors propose future directions, such as expanding dataset diversity, leveraging transfer learning, and developing more effective multi-task and context-aware HOI models to improve generalization and real-world applicability.

Abstract

Paper Structure (8 sections, 1 figure, 3 tables)

This paper contains 8 sections, 1 figure, 3 tables.

Introduction
Datasets
The architectures of HOI Detection
Two-stage HOI detection architecture
one-stage HOI detection architecture
New techniques
Complex problem of HOI detection
Conclusion

Figures (1)

Figure 1: The flowchart of the HOI detection algorithm.

A Review of Human-Object Interaction Detection

TL;DR

Abstract

A Review of Human-Object Interaction Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (1)