Table of Contents
Fetching ...

Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs

Zhaolin Cai, Huiyu Duan, Zitong Xu, Fan Li, Zhi Liu, Jing Liu, Wei Shen, Xiongkuo Min, Guangtao Zhai

TL;DR

This work reframes HOI detection from a closed-set, discriminative problem to an open-vocabulary generative task by guiding a frozen multimodal LLM with a differentiable cognitive steering conduit. It introduces a hybrid interaction representation and a light-weight CSC that converts visual evidence into a structured visual kernel, enabling the MLLM to generate task-aligned interactions while preserving its world knowledge. A multi-task training objective combines generative supervision, semantic alignment, and commonsense constraints to ensure grounded yet flexible reasoning. Empirically, GRASP-HOI achieves state-of-the-art closed-set performance and strong zero-shot and open-vocabulary generalization on HICO-DET and V-COCO, demonstrating a unified paradigm that bridges discriminative perception and generative reasoning for open-world HOI detection.

Abstract

Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose \GRASP-HO}, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction representations, then design a lightweight learnable cognitive steering conduit (CSC) module to inject the fine-grained visual evidence into a frozen MLLM for effective reasoning. To address the supervision mismatch between classification-based HOI datasets and open-vocabulary generative models, we introduce a hybrid guidance strategy that coupling the language modeling loss and auxiliary classification loss, enabling discriminative grounding without sacrificing generative flexibility. Experiments demonstrate state-of-the-art closed-set performance and strong zero-shot generalization, achieving a unified paradigm that seamlessly bridges discriminative perception and generative reasoning for open-world HOI detection.

Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs

TL;DR

This work reframes HOI detection from a closed-set, discriminative problem to an open-vocabulary generative task by guiding a frozen multimodal LLM with a differentiable cognitive steering conduit. It introduces a hybrid interaction representation and a light-weight CSC that converts visual evidence into a structured visual kernel, enabling the MLLM to generate task-aligned interactions while preserving its world knowledge. A multi-task training objective combines generative supervision, semantic alignment, and commonsense constraints to ensure grounded yet flexible reasoning. Empirically, GRASP-HOI achieves state-of-the-art closed-set performance and strong zero-shot and open-vocabulary generalization on HICO-DET and V-COCO, demonstrating a unified paradigm that bridges discriminative perception and generative reasoning for open-world HOI detection.

Abstract

Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose \GRASP-HO}, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction representations, then design a lightweight learnable cognitive steering conduit (CSC) module to inject the fine-grained visual evidence into a frozen MLLM for effective reasoning. To address the supervision mismatch between classification-based HOI datasets and open-vocabulary generative models, we introduce a hybrid guidance strategy that coupling the language modeling loss and auxiliary classification loss, enabling discriminative grounding without sacrificing generative flexibility. Experiments demonstrate state-of-the-art closed-set performance and strong zero-shot generalization, achieving a unified paradigm that seamlessly bridges discriminative perception and generative reasoning for open-world HOI detection.

Paper Structure

This paper contains 37 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An overview of the traditional discriminative matching method and the proposed generative reasoning paradigm for HOI detection. (a) Traditional methods classify each detected human-object pair via classification or matching, limiting to frequent labels. (b) GRASP-HOI fuses multi-source features to steer a frozen MLLM, generating context-aware interactions beyond closed-set.
  • Figure 2: The architecture of GRASP-HOI, which performs open-vocabulary HOI detection by first process multi-source representations then steering a frozen generative model to describe them. The Instance Evidence Encoder and Appearance Evidence Encoder provide identified humans and objects in the image and extract visual features from the detected human, object, and their bounding box. Then a salience adjudication transformer and an Orchestration Gate distill the set of interaction features. The Cognitive Steering Conduit adjudicated candidate token with a global scene token from the frozen MLLM vision encoder into an evidence vector $e_k$. The visual kernel formulator transduces $e_k$ into a sequential visual kernel $Q_k$ to finally guide the frozen MLLM. This process enables GRASP-HOI to leverage a powerful, frozen MLLM for HOI detection with minimal, targeted training.
  • Figure 3: The architecture of Cognitive Steering Conduit. The evidence fusion module produces unified evidence vector $e_k$. The visual kernel formulator then transduces $e_k$ into the visual kernel $Q_k \in \mathbb{R}^{L \times d}$ which serves as a soft visual prefix to steer the frozen MLLM.
  • Figure 4: Qualitative visualization of the steering effect of the Cognitive Steering Conduit (CSC) on HICO-DET. (a) Attention from the frozen vision encoder is diffuse and often focuses on irrelevant regions. (b) Our visual kernel yields concentrated responses on the target human-object interaction regions.