Table of Contents
Fetching ...

Geometric Features Enhanced Human-Object Interaction Detection

Manli Zhu, Edmond S. L. Ho, Shuang Chen, Longzhi Yang, Hubert P. H. Shum

TL;DR

GeoHOI tackles the robustness of human–object interaction detection in occluded and cluttered scenes by injecting fine-grained geometric priors into an end-to-end Transformer HOI framework. It introduces UniPointNet for self-supervised, cross-category keypoint learning, a Keypoint-aware Interactiveness Prediction (KIP) module to mine cross-instance cues via a graph convolution network, and a Part Attention Module (PAM) to focus on informative local human and object parts. Together, these components enrich interaction queries and improve interactiveness and interaction classification, achieving state-of-the-art results on V-COCO and competitive performance on HICO-DET, with a compelling post-disaster UAV case study. The work demonstrates the practical value of geometric priors in Transformer-based HOI detection and points to future directions in adaptive keypoint representations and large-language-model–assisted long-tail HOI learning.

Abstract

Cameras are essential vision instruments to capture images for pattern detection and measurement. Human-object interaction (HOI) detection is one of the most popular pattern detection approaches for captured human-centric visual scenes. Recently, Transformer-based models have become the dominant approach for HOI detection due to their advanced network architectures and thus promising results. However, most of them follow the one-stage design of vanilla Transformer, leaving rich geometric priors under-exploited and leading to compromised performance especially when occlusion occurs. Given that geometric features tend to outperform visual ones in occluded scenarios and offer information that complements visual cues, we propose a novel end-to-end Transformer-style HOI detection model, i.e., geometric features enhanced HOI detector (GeoHOI). One key part of the model is a new unified self-supervised keypoint learning method named UniPointNet that bridges the gap of consistent keypoint representation across diverse object categories, including humans. GeoHOI effectively upgrades a Transformer-based HOI detector benefiting from the keypoints similarities measuring the likelihood of human-object interactions as well as local keypoint patches to enhance interaction query representation, so as to boost HOI predictions. Extensive experiments show that the proposed method outperforms the state-of-the-art models on V-COCO and achieves competitive performance on HICO-DET. Case study results on the post-disaster rescue with vision-based instruments showcase the applicability of the proposed GeoHOI in real-world applications.

Geometric Features Enhanced Human-Object Interaction Detection

TL;DR

GeoHOI tackles the robustness of human–object interaction detection in occluded and cluttered scenes by injecting fine-grained geometric priors into an end-to-end Transformer HOI framework. It introduces UniPointNet for self-supervised, cross-category keypoint learning, a Keypoint-aware Interactiveness Prediction (KIP) module to mine cross-instance cues via a graph convolution network, and a Part Attention Module (PAM) to focus on informative local human and object parts. Together, these components enrich interaction queries and improve interactiveness and interaction classification, achieving state-of-the-art results on V-COCO and competitive performance on HICO-DET, with a compelling post-disaster UAV case study. The work demonstrates the practical value of geometric priors in Transformer-based HOI detection and points to future directions in adaptive keypoint representations and large-language-model–assisted long-tail HOI learning.

Abstract

Cameras are essential vision instruments to capture images for pattern detection and measurement. Human-object interaction (HOI) detection is one of the most popular pattern detection approaches for captured human-centric visual scenes. Recently, Transformer-based models have become the dominant approach for HOI detection due to their advanced network architectures and thus promising results. However, most of them follow the one-stage design of vanilla Transformer, leaving rich geometric priors under-exploited and leading to compromised performance especially when occlusion occurs. Given that geometric features tend to outperform visual ones in occluded scenarios and offer information that complements visual cues, we propose a novel end-to-end Transformer-style HOI detection model, i.e., geometric features enhanced HOI detector (GeoHOI). One key part of the model is a new unified self-supervised keypoint learning method named UniPointNet that bridges the gap of consistent keypoint representation across diverse object categories, including humans. GeoHOI effectively upgrades a Transformer-based HOI detector benefiting from the keypoints similarities measuring the likelihood of human-object interactions as well as local keypoint patches to enhance interaction query representation, so as to boost HOI predictions. Extensive experiments show that the proposed method outperforms the state-of-the-art models on V-COCO and achieves competitive performance on HICO-DET. Case study results on the post-disaster rescue with vision-based instruments showcase the applicability of the proposed GeoHOI in real-world applications.

Paper Structure

This paper contains 23 sections, 18 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Simple illustration of STIP. The solid bi-directional arrow means whether or not two HOI triplets share the same human or object, and the dashed bi-directional arrow denotes they do not share anything. (a) Given an input image, DETR is used to detect humans and objects. (b) By constructing all possible human-object pairs, the interaction proposal network uses pairwise features to filter non-interactive ones. (c) Next, an interaction-centric graph is built to inject rich inter-interaction semantic structure and intra-interaction spatial structure. (d) Finally, a structure-aware Transformer is utilized to output a set of HOI predictions.
  • Figure 2: An overview of our GeoHOI framework, in which denotes the concatenation operation. (a) Given an image, we adopt the off-the-shelf Panoptic DETR detr to detect the human and object instances within this image, generating their bounding boxes and segmentation masks. Based on the masks, we use our proposed UniPointNet to detect keypoints for all instances. (b) With the detected instances, the keypoint-aware interactiveness prediction module enumerates all possible human-object pairs. It then generates interactive ones with the highest interactiveness scores using coarse instance-level features, including pairwise and holistic graph features. (c) By taking all the interactive human-object pairs, we enhance their representations with human and object local patches, which are attended by self-attention. This encourages each interaction query to focus on informative human and object parts. The final concatenated representations serve as interaction queries which are then fed into the structure-aware Transformer stip to output a set of HOI predictions.
  • Figure 3: Illustration of the graph convolution layer, in which $\otimes$ represents the tensor product, and $\oplus$ is the residual connection. The output graph features encode relationships between all humans and objects from a global perspective, with keypoint similarity measuring their connectivity.
  • Figure 4: Overview of the self-supervised keypoints learning framework (UniPointNet). Given an object segmentation, we detect the keypoints with learnable graph edge weights by reconstructing its binary mask. The edge weights are represented by a color matrix and are shared across segmentation masks within clusters of similar shapes and structures. The masked segmentation binary map provides minimal appearance information, forcing the network to focus on learning keypoints that are important for representing the structure and shape of an object.
  • Figure 5: Qualitative results. The upper row showcases the effectiveness of the keypoints representation, while the lower row depicts failure cases.
  • ...and 6 more figures