ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Mingda Jia; Liming Zhao; Ge Li; Yun Zheng

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Mingda Jia, Liming Zhao, Ge Li, Yun Zheng

TL;DR

ContextHOI addresses the robustness gap in HOI detection under occlusion by introducing a dual-branch framework that learns spatial context alongside instance features. It combines spatially contrastive and semantic-guided supervision, powered by a context aggregator and CLIP-based guidance, to fuse context with instance information for HOI prediction. The approach achieves state-of-the-art results on HICO-DET and v-coco and shows strong robustness on the newly proposed HICO-ambiguous benchmark, including zero-shot capabilities. This work demonstrates that explicit spatial context modeling can significantly enhance action recognition when foreground cues are weak or blurred, offering practical improvements for real-world HOI understanding tasks.

Abstract

Spatial contexts, such as the backgrounds and surroundings, are considered critical in Human-Object Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-ambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

TL;DR

Abstract

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)