Table of Contents
Fetching ...

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

Zhenlong Yuan, Xiangyan Qu, Jing Tang, Rui Chen, Lei Sun, Ruidong Chen, Hongwei Yu, Chengxuan Qian, Xiangxiang Chu, Shuo Li, Yuyin Zhou

TL;DR

The paper tackles open-vocabulary HOI detection by addressing perceptual-cognitive gaps, cross-modal hallucination, and occlusion. It introduces ImagineAgent, an agentic framework that blends cognitive mapping, tool augmentation, and generative imagination within a reinforcement learning loop to robustly reason about HOIs. A two-stage workflow, augmented with a data-collection pipeline for structured reasoning chains and the GRPO-based policy optimization, yields state-of-the-art results on SWIG-HOI and HICO-DET with roughly 20% of the training data required by prior methods. This work advances open-vocabulary visual reasoning by tightly integrating perception, external knowledge tools, and imagined viewpoints, with potential to extend to broader tasks and more extensive tool libraries.

Abstract

Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

TL;DR

The paper tackles open-vocabulary HOI detection by addressing perceptual-cognitive gaps, cross-modal hallucination, and occlusion. It introduces ImagineAgent, an agentic framework that blends cognitive mapping, tool augmentation, and generative imagination within a reinforcement learning loop to robustly reason about HOIs. A two-stage workflow, augmented with a data-collection pipeline for structured reasoning chains and the GRPO-based policy optimization, yields state-of-the-art results on SWIG-HOI and HICO-DET with roughly 20% of the training data required by prior methods. This work advances open-vocabulary visual reasoning by tightly integrating perception, external knowledge tools, and imagined viewpoints, with potential to extend to broader tasks and more extensive tool libraries.

Abstract

Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.
Paper Structure (27 sections, 10 equations, 10 figures, 6 tables)

This paper contains 27 sections, 10 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Three limitations in OV-HOI tasks. Given (a) an original image, MLLMs may suffer from (b) perceptual-cognitive gap, where they perceive individual entities but fail to form a coherent understanding, leading to flawed cognition (e.g., linking Person 1 to Helmet 2 and Glove they are not wearing). Another critical issue is (c) cross-modal hallucination, where models generate plausible-sounding but visually unsupported interactions by over-relying on textual priors (e.g., mistaking a baseball bat for a tennis racket). Finally, (d) occlusion-induced ambiguity arises when key visual information is missing, causing the model fail to detect entities and their connections.
  • Figure 2: Pipeline of ImagineAgent. Initially, we construct a high-quality dataset of structured reasoning chains by performing multiple rollouts with a powerful base model and selecting successful trajectories. Each data sample encapsulates a two-round reasoning process: (1) perception and tool selection, and (2) tool augmentation and cognition. Subsequently, this dataset is adopted for SFT to initialize the agent's policy, teaching it the fundamental structure of reasoning. Following this, the model's policy is further refined through RL using the GRPO algorithm, where the agent learns to make optimal decisions by performing rollouts and receiving feedback from a composite reward that balance precision & recall of predictions, structural coherence and tool efficiency, thus enabling robust and effective inference.
  • Figure 3: Tool Library. To resolve visual and semantic ambiguities from the input image, our framework equips the agent with diverse tools. It leverages the BAGEL model for generative imagination via Outpaint and Viewpoint Transform, the Qwen API for Online RAG through Action Explanation and Scene Explanation, and a standard Image Crop tool to extract focused fine-grained interaction details.
  • Figure 4: Qualitative results of ImagineAgent's generative imagination. Respectively, Image Outpaint is used to extend the scene's context, providing a more holistic view of the environment that aids in understanding the overall activity. To overcome severe occlusion, the agent employs Image View Transformation to synthesize a novel perspective of the interaction. Furthermore, Image Crop is utilized for a focused analysis of fine-grained interactions, allowing the model to scrutinize details that are critical for detection.
  • Figure 5: Case Study between Qwen2.5-VL-7B and our ImagineAgent. Our method can accurately identify comprehensive HOIs.
  • ...and 5 more figures