Table of Contents
Fetching ...

HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

Junwen Chen, Peilin Xiong, Keiji Yanai

TL;DR

This paper tackles Human-Object Interaction Detection (HOID) by eliminating object detectors and instead leveraging a pure multimodal LLM to reason about HOIs in natural language. It introduces HOI-R1, a two-stage framework combining supervised fine-tuning with thinking distillation and reinforcement learning using HOID-specific rewards, guided by carefully designed prompts. On the HICO-DET dataset, HOI-R1 achieves roughly a 2x improvement in mAP over strong baselines, demonstrating that MLLMs can effectively perform structured HOID with minimal architectural changes. The work highlights the potential of end-to-end language-based HOID, supported by ablations showing the importance of reasoning traces and IoU-based rewards for localization accuracy.

Abstract

Recent Human-object interaction detection (HOID) methods highly require prior knowledge from VLMs to enhance the interaction recognition capabilities. The training strategies and model architectures for connecting the knowledge from VLMs to the HOI instance representations from the object detector are challenging, and the whole framework is complex for further development or application. On the other hand, the inherent reasoning abilities of MLLMs on human-object interaction detection are under-explored. Inspired by the recent success of training MLLMs with reinforcement learning (RL) methods, we propose HOI-R1 and first explore the potential of the language model on the HOID task without any additional detection modules. We introduce an HOI reasoning process and HOID reward functions to solve the HOID task by pure text. The results on the HICO-DET dataset show that HOI-R1 achieves 2x the accuracy of the baseline with great generalization ability. The source code is available at https://github.com/cjw2021/HOI-R1.

HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

TL;DR

This paper tackles Human-Object Interaction Detection (HOID) by eliminating object detectors and instead leveraging a pure multimodal LLM to reason about HOIs in natural language. It introduces HOI-R1, a two-stage framework combining supervised fine-tuning with thinking distillation and reinforcement learning using HOID-specific rewards, guided by carefully designed prompts. On the HICO-DET dataset, HOI-R1 achieves roughly a 2x improvement in mAP over strong baselines, demonstrating that MLLMs can effectively perform structured HOID with minimal architectural changes. The work highlights the potential of end-to-end language-based HOID, supported by ablations showing the importance of reasoning traces and IoU-based rewards for localization accuracy.

Abstract

Recent Human-object interaction detection (HOID) methods highly require prior knowledge from VLMs to enhance the interaction recognition capabilities. The training strategies and model architectures for connecting the knowledge from VLMs to the HOI instance representations from the object detector are challenging, and the whole framework is complex for further development or application. On the other hand, the inherent reasoning abilities of MLLMs on human-object interaction detection are under-explored. Inspired by the recent success of training MLLMs with reinforcement learning (RL) methods, we propose HOI-R1 and first explore the potential of the language model on the HOID task without any additional detection modules. We introduce an HOI reasoning process and HOID reward functions to solve the HOID task by pure text. The results on the HICO-DET dataset show that HOI-R1 achieves 2x the accuracy of the baseline with great generalization ability. The source code is available at https://github.com/cjw2021/HOI-R1.

Paper Structure

This paper contains 13 sections, 20 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of the pipeline of traditional HOID methods and our proposed HOI-R1. Traditional HOID methods rely on object detectors to extract HOI embeddings, while HOI-R1 directly interprets interactions through natural language reasoning using MLLMs.
  • Figure 2: Training convergence of HOI-R1 with Qwen2.5-VL-3B-Instruct on HICO-DET. The mAP of Full category on Default Setting is shown. HOI-R1 achieves more than 2x performance boost with only 1 epoch SFT and 40 steps RL training.
  • Figure 3: Overview of our HOI-R1 framework. The input consists of two modalities: image and text. The question text consists of three parts: the task instruction includes basic information about the task, the reasoning guidance provides hints for the reasoning process, and the format example regularizes the output. First, a Teacher MLLM model is used to generate reasoning steps for Supervised Fine-tuning (SFT). Then, in the Reinforcement Learning (RL) stage, the student MLLM model, as the policy model, is trained with four reward signals.
  • Figure 4: The input question template for HOI-R1. The template consists of three key components: Task Instruction, Reasoning Guidance, and Format Example.
  • Figure 5: The reward functions of HOI-R1. We design key format reward, label reward, and label reward to ensure the structural, semantic, and geometric alignment of the model outputs with the ground truth.
  • ...and 1 more figures