Table of Contents
Fetching ...

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Mingda Jia, Liming Zhao, Ge Li, Yun Zheng

TL;DR

ContextHOI addresses the robustness gap in HOI detection under occlusion by introducing a dual-branch framework that learns spatial context alongside instance features. It combines spatially contrastive and semantic-guided supervision, powered by a context aggregator and CLIP-based guidance, to fuse context with instance information for HOI prediction. The approach achieves state-of-the-art results on HICO-DET and v-coco and shows strong robustness on the newly proposed HICO-ambiguous benchmark, including zero-shot capabilities. This work demonstrates that explicit spatial context modeling can significantly enhance action recognition when foreground cues are weak or blurred, offering practical improvements for real-world HOI understanding tasks.

Abstract

Spatial contexts, such as the backgrounds and surroundings, are considered critical in Human-Object Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-ambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

TL;DR

ContextHOI addresses the robustness gap in HOI detection under occlusion by introducing a dual-branch framework that learns spatial context alongside instance features. It combines spatially contrastive and semantic-guided supervision, powered by a context aggregator and CLIP-based guidance, to fuse context with instance information for HOI prediction. The approach achieves state-of-the-art results on HICO-DET and v-coco and shows strong robustness on the newly proposed HICO-ambiguous benchmark, including zero-shot capabilities. This work demonstrates that explicit spatial context modeling can significantly enhance action recognition when foreground cues are weak or blurred, offering practical improvements for real-world HOI understanding tasks.

Abstract

Spatial contexts, such as the backgrounds and surroundings, are considered critical in Human-Object Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-ambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.

Paper Structure

This paper contains 15 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The role of context learning in HOI Detection. Spatial context, like a $parking lot$ or a $city road$ helps little with identify the salient car. However, context is critical in distinguishing human interactions. Both $parking$ and $driving$ are highly related to the context information.
  • Figure 2: The overall architecture of ContextHOI. ContextHOI has a dual-branch and fusion structure, with instance detection and context learning branches. The instance detection branch captures instance-centric attributes, while the context learning branch focuses on instance-independent context features. We introduce a semantic-guided instance/context exploration module to distil prior knowledge from VLM to help ground informative visual content. A set of spatially contrastive constraints supervises the learned instances and contexts to focus on different visual aspects. Finally, a context aggregator will fuse the instance and context feature for HOI prediction.
  • Figure 3: Inner design of semantic-guided context exploration module. $\textrm{pooling}$ refers to mean pooling on the spatial dimension of $\hat{\mathbf{Z}}$, $\textrm{concat}$ refers to concatenation.
  • Figure 4: Visualization analysis on spatial context learning. (a) The feature map of the last layer of instance decoder, context aggregator and context extractor, indexed by the highest logits. Our instance decoder focuses on the appearance of the car, and the context extractor captures backgrounds and surrounding humans. (b) The features that are captured by context extractors with different component compositions. Both components help capture spatial contexts. Best viewed in color. Please zoom in for details.
  • Figure 5: Visualizations of the visual feature captured by ContextHOI on images in HICO-DET (ambiguous). We mask the predicted instance boxes and let GPT-4V gpt4_openai describe the left images; the words describing contexts in GPT captions are selected and shown as the yellow light texts.