Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

Ting Lei; Shaofeng Yin; Yang Liu

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

Ting Lei, Shaofeng Yin, Yang Liu

TL;DR

This work addresses open-vocabulary HOI detection by introducing CMD-SE, a framework that combines distance-aware, multi-level decoding with fine-grained body-part semantics derived from LLM prompts. By decoding HOIs from multiple feature levels and guiding learning with a soft distance constraint, it handles diverse human–object distances. It further enhances recognition by incorporating GPT-generated body-part state descriptions into CLIP-based embeddings, improving zero-shot performance on unseen interactions. Experiments on SWIG-HOI and HICO-DET demonstrate state-of-the-art results, highlighting the practical impact for open-world understanding of human-centric scenes.

Abstract

Open-vocabulary human-object interaction (HOI) detection, which is concerned with the problem of detecting novel HOIs guided by natural language, is crucial for understanding human-centric scenes. However, prior zero-shot HOI detectors often employ the same levels of feature maps to model HOIs with varying distances, leading to suboptimal performance in scenes containing human-object pairs with a wide range of distances. In addition, these detectors primarily rely on category names and overlook the rich contextual information that language can provide, which is essential for capturing open vocabulary concepts that are typically rare and not well-represented by category names alone. In this paper, we introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement (CMD-SE), harnessing the potential of Visual-Language Models (VLMs). Specifically, we propose to model human-object pairs with different distances with different levels of feature maps by incorporating a soft constraint during the bipartite matching process. Furthermore, by leveraging large language models (LLMs) such as GPT models, we exploit their extensive world knowledge to generate descriptions of human body part states for various interactions. Then we integrate the generalizable and fine-grained semantics of human body parts to improve interaction recognition. Experimental results on two datasets, SWIG-HOI and HICO-DET, demonstrate that our proposed method achieves state-of-the-art results in open vocabulary HOI detection. The code and models are available at https://github.com/ltttpku/CMD-SE-release.

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

TL;DR

Abstract

Paper Structure (16 sections, 13 equations, 4 figures, 6 tables)

This paper contains 16 sections, 13 equations, 4 figures, 6 tables.

Introduction
Related Work
Generic HOI Detection
Vision-Language Modeling in HOI Detection
Leverage LLM for Text Classifier
Method
Preliminary
Conditional Multi-level Decoding
Fine-grained Semantic Enhancement
Training and Inference
Experiment
Experimental Setting
Comparison with Other Methods
Ablation Study
Qualitative Results
...and 1 more sections

Figures (4)

Figure 1: (a) Previous method (THID) suffers from severe performance drop on HOIs with larger distances in the open-vocabulary setting. (b) Compared with HOI categorical names, body parts' descriptions could better recognize the correlation of human postures between different actions. For instance, the action of hurling and picking typically involves extended arms, whereas kicking is characterized by extended legs.
Figure 2: The framework of our CMD-SE. Given an image, the visual encoder is first applied to extract the multi-level visual features. Then we decode the HOIs from multi-level feature maps through a shared HOI decoder parallelly and encourage the HOIs decoded from low-level feature maps to model HOIs with small distances and vice versa via conditional matching. Additionally, we query GPT to describe the states of human body parts for each HOI and utilize the generalizable and fine-grained descriptions as additional prompts to improve interaction recognition.
Figure 3: Illustration of generating prompts with GPT. The main purpose of the entire process is to find the most simple and general body parts descriptions for each HOI.
Figure 4: Qualitative results of our method on SWIG-HOI test set.

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

TL;DR

Abstract

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (4)