Table of Contents
Fetching ...

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

Shiyu Xuan, Dongkai Wang, Zechao Li, Jinhui Tang

TL;DR

Zero-shot HOI detection is achieved by fully decoupling object detection from interaction recognition and leveraging multi-modal large language models for open-vocabulary IR framed as deterministic visual question answering. Spatial-aware pooling and one-pass deterministic matching address detector noise and computational efficiency, enabling a detector-agnostic pipeline that can pair with any detector without retraining. Experiments on HICO-DET and V-COCO demonstrate strong zero-shot and cross-dataset generalization, as well as training-free IR capabilities. The approach offers a flexible, scalable paradigm for HOI detection that can benefit from advances in detectors and MLLMs.

Abstract

Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage methods, tightly couple IR with a specific detector and rely on coarse-grained vision-language model (VLM) features, which limit generalization to unseen interactions. In this work, we propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. We introduce a deterministic generation method that formulates IR as a visual question answering task and enforces deterministic outputs, enabling training-free zero-shot IR. To further enhance performance and efficiency by fine-tuning the model, we design a spatial-aware pooling module that integrates appearance and pairwise spatial cues, and a one-pass deterministic matching method that predicts all candidate interactions in a single forward pass. Extensive experiments on HICO-DET and V-COCO demonstrate that our method achieves superior zero-shot performance, strong cross-dataset generalization, and the flexibility to integrate with any object detectors without retraining. The codes are publicly available at https://github.com/SY-Xuan/DA-HOI.

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

TL;DR

Zero-shot HOI detection is achieved by fully decoupling object detection from interaction recognition and leveraging multi-modal large language models for open-vocabulary IR framed as deterministic visual question answering. Spatial-aware pooling and one-pass deterministic matching address detector noise and computational efficiency, enabling a detector-agnostic pipeline that can pair with any detector without retraining. Experiments on HICO-DET and V-COCO demonstrate strong zero-shot and cross-dataset generalization, as well as training-free IR capabilities. The approach offers a flexible, scalable paradigm for HOI detection that can benefit from advances in detectors and MLLMs.

Abstract

Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage methods, tightly couple IR with a specific detector and rely on coarse-grained vision-language model (VLM) features, which limit generalization to unseen interactions. In this work, we propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. We introduce a deterministic generation method that formulates IR as a visual question answering task and enforces deterministic outputs, enabling training-free zero-shot IR. To further enhance performance and efficiency by fine-tuning the model, we design a spatial-aware pooling module that integrates appearance and pairwise spatial cues, and a one-pass deterministic matching method that predicts all candidate interactions in a single forward pass. Extensive experiments on HICO-DET and V-COCO demonstrate that our method achieves superior zero-shot performance, strong cross-dataset generalization, and the flexibility to integrate with any object detectors without retraining. The codes are publicly available at https://github.com/SY-Xuan/DA-HOI.
Paper Structure (19 sections, 9 equations, 6 figures, 10 tables)

This paper contains 19 sections, 9 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: (a) Existing methods, including two-stage methods, couple object detection and interaction recognition together. Their performance are constrained by the limited generalization of detector features and coarse-grained VLM features. (b) Our method fully decouples these two processes and harnesses the powerful MLLMs for interaction recognition. This design benefits from the generalization of MLLMs and advanced detectors.
  • Figure 2: The overall framework of our method and the spatial-aware pooling (SAP). (a) The proposed method decouples the object detection and interaction recognition for HOI detection. With the detected human-object pair, a MLLM is used to recognize their interaction. To enhance both performance and inference efficiency, SAP integrates appearance and pairwise spatial cues, and a one-pass deterministic matching method enables the prediction of all candidate interactions in a single forward pass. (b) SAP takes the human and object features as input. The cross attention layer aggregates features beyond the area of bounding box, enhancing robust to the noise in the detection results. Spatial Embedding encodes the useful pairwise information into the interaction features.
  • Figure 3: Visualization of successful examples. Humans (subject) are marked with red rectangles, and objects with yellow rectangles.
  • Figure 4: Visualization of failure cases. Humans (subject) are marked with red rectangles, and objects with yellow rectangles.
  • Figure 5: Visualization of cross attention map obtained from the spatial-aware pooling. The human is marked with red rectangle, while the object is marked with yellow rectangle.
  • ...and 1 more figures