Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

Yu-Wei Zhan; Fan Liu; Xin Luo; Xin-Shun Xu; Liqiang Nie; Mohan Kankanhalli

Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

Yu-Wei Zhan, Fan Liu, Xin Luo, Xin-Shun Xu, Liqiang Nie, Mohan Kankanhalli

TL;DR

ConCue is proposed, a novel approach that integrates contextual cue generation with feature extraction to enhance HOI detection and leads to significant performance improvements on two widely used benchmark datasets, highlighting the potential of the approach in advancing HOI detection.

Abstract

Human-Object Interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions. However, conventional HOI detection methods often struggle to fully capture the contextual information needed to accurately identify these interactions. While large Vision-Language Models (VLMs) show promise in tasks involving human interactions, they are not tailored for HOI detection. The complexity of human behavior and the diverse contexts in which these interactions occur make it further challenging. Contextual cues, such as the participants involved, body language, and the surrounding environment, play crucial roles in predicting these interactions, especially those that are unseen or ambiguous. Moreover, large VLMs are trained on vast image and text data, enabling them to generate contextual cues that help in understanding real-world contexts, object relationships, and typical interactions. Building on this, in this paper we introduce ConCue, a novel approach for improving visual feature extraction in HOI detection. Specifically, we first design specialized prompts to utilize large VLMs to generate contextual cues within an image. To fully leverage these cues, we develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors. Extensive experiments and analyses demonstrate the effectiveness of using these contextual cues for HOI detection. The experimental results show that integrating ConCue with existing state-of-the-art methods significantly enhances their performance on two widely used datasets.

Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

TL;DR

Abstract

Paper Structure (34 sections, 12 equations, 5 figures, 7 tables)

This paper contains 34 sections, 12 equations, 5 figures, 7 tables.

Introduction
Related Work
Human-Object Interaction Detection
Pretrained Vision-Language Models
Preliminaries
ConCue
Overview
Contextual Cues Generation
Prompts with Large VLM
Contextual Cues
Contextual Cue-Enhanced Feature Extraction
HOI Encoders and Decoders
Contextual Cues and Image Encoders
Contextual Cue-Aware Instance Decoder
Contextual Cue-Aware Interaction Decoder
...and 19 more sections

Figures (5)

Figure 1: An example illustrating the performance of HOI detection methods. (1) Large VLMs (e.g. InstructBlip) are not specifically designed for HOI detection, and (2) conventional HOI methods focus on visual information but overlook contextual information, leading to incorrect interaction classification. (3) Our proposed approach leverages contextual cues to achieve more accurate results.
Figure 2: The overall framework of ConCue. We first utilize a large VLM (e.g., InstructBlip) to generate contextual cues, such as participant cues, body language cues, environmental cues, and temporal cues within images. These cues are then incorporated into the contextual cue-aware instance and interaction decoders to enhance feature extraction for HOI detection.
Figure 3: An illustration of generating contextual cues, including (a) # Participant Cues #, (b) # Body Language Cues #, (c) # Environmental Cues #, and (d) # Temporal Cues #.
Figure 4: Structure of the transformer-based contextual cue-enhanced feature extraction module. This module primarily consisted of multiple cue-driven decoders and a feature fusion component.
Figure 5: Attention Visualization. The upper section shows the visualization of spatial feature attention in the interaction decoder, while the lower section displays the visualization of contextual visual cue attention in the decoder. In the upper section, the four images from left to right represent the prediction results, and the spatial attention maps for participant cues, body language cues, and environmental cues, respectively. In the lower section, words are highlighted in red if their attention exceeds the threshold.

Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

TL;DR

Abstract

Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)