Table of Contents
Fetching ...

Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

Hang Zhang, Wenxiao Zhang, Haoxuan Qu, Jun Liu

TL;DR

We address the challenge of reliable video-based HOI detection in dynamic scenes by integrating external LLM reasoning with a strong V-HOI backbone. The proposed V-HOI MLCR framework employs two-stage Cross-Agents Reasoning and Multi-LLMs Debate to fuse diverse common-sense, spatial, and temporal knowledge, supplemented by CLIP-based auxiliary supervision. Empirical results on Action Genome and VidHOI show consistent accuracy gains over strong baselines, validating the effectiveness of external reasoning for HOI in videos. The approach offers a practical plug-and-play pathway to enhance human-centered scene understanding for robotics and autonomous systems.

Abstract

Human-centered dynamic scene understanding plays a pivotal role in enhancing the capability of robotic and autonomous systems, in which Video-based Human-Object Interaction (V-HOI) detection is a crucial task in semantic scene understanding, aimed at comprehensively understanding HOI relationships within a video to benefit the behavioral decisions of mobile robots and autonomous driving systems. Although previous V-HOI detection models have made significant strides in accurate detection on specific datasets, they still lack the general reasoning ability like human beings to effectively induce HOI relationships. In this study, we propose V-HOI Multi-LLMs Collaborated Reasoning (V-HOI MLCR), a novel framework consisting of a series of plug-and-play modules that could facilitate the performance of current V-HOI detection models by leveraging the strong reasoning ability of different off-the-shelf pre-trained large language models (LLMs). We design a two-stage collaboration system of different LLMs for the V-HOI task. Specifically, in the first stage, we design a Cross-Agents Reasoning scheme to leverage the LLM conduct reasoning from different aspects. In the second stage, we perform Multi-LLMs Debate to get the final reasoning answer based on the different knowledge in different LLMs. Additionally, we devise an auxiliary training strategy that utilizes CLIP, a large vision-language model to enhance the base V-HOI models' discriminative ability to better cooperate with LLMs. We validate the superiority of our design by demonstrating its effectiveness in improving the prediction accuracy of the base V-HOI model via reasoning from multiple perspectives.

Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

TL;DR

We address the challenge of reliable video-based HOI detection in dynamic scenes by integrating external LLM reasoning with a strong V-HOI backbone. The proposed V-HOI MLCR framework employs two-stage Cross-Agents Reasoning and Multi-LLMs Debate to fuse diverse common-sense, spatial, and temporal knowledge, supplemented by CLIP-based auxiliary supervision. Empirical results on Action Genome and VidHOI show consistent accuracy gains over strong baselines, validating the effectiveness of external reasoning for HOI in videos. The approach offers a practical plug-and-play pathway to enhance human-centered scene understanding for robotics and autonomous systems.

Abstract

Human-centered dynamic scene understanding plays a pivotal role in enhancing the capability of robotic and autonomous systems, in which Video-based Human-Object Interaction (V-HOI) detection is a crucial task in semantic scene understanding, aimed at comprehensively understanding HOI relationships within a video to benefit the behavioral decisions of mobile robots and autonomous driving systems. Although previous V-HOI detection models have made significant strides in accurate detection on specific datasets, they still lack the general reasoning ability like human beings to effectively induce HOI relationships. In this study, we propose V-HOI Multi-LLMs Collaborated Reasoning (V-HOI MLCR), a novel framework consisting of a series of plug-and-play modules that could facilitate the performance of current V-HOI detection models by leveraging the strong reasoning ability of different off-the-shelf pre-trained large language models (LLMs). We design a two-stage collaboration system of different LLMs for the V-HOI task. Specifically, in the first stage, we design a Cross-Agents Reasoning scheme to leverage the LLM conduct reasoning from different aspects. In the second stage, we perform Multi-LLMs Debate to get the final reasoning answer based on the different knowledge in different LLMs. Additionally, we devise an auxiliary training strategy that utilizes CLIP, a large vision-language model to enhance the base V-HOI models' discriminative ability to better cooperate with LLMs. We validate the superiority of our design by demonstrating its effectiveness in improving the prediction accuracy of the base V-HOI model via reasoning from multiple perspectives.
Paper Structure (21 sections, 3 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 3 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The initial prediction from the SOTA V-HOI modelni2023human will cause incorrect relation prediction. Instead, our proposed MLCR refined the prediction to get the correct results.
  • Figure 2: Method Overview. Upon the analysis of video sequences, we initially apply the state-of-the-art models for V-HOI detection to obtain preliminary prediction triplets for individual frames. These triplets are then converted into textual form and processed through our LLMs collaborative framework. Our framework operates in two primary stages: the first stage is Cross-Agents Reasoning, where various distinct agents are established within each different LLM (ChatGPT, LLaMA2, and PaLM2) to evaluate the logic of the predictions from different perspectives, including spatial and temporal coherence. The second stage is the Multi-LLMs Debate, where we integrate responses from various LLMs in a debate-style format to refine and finalize the predictions.
  • Figure 3: We employ each LLM (e.g. ChatGPT) as a cross-aspect reasoning agent to enhance the accuracy of predictions derived from current V-HOI detection models. Within this framework, we have architected three specialized reasoning agents—those that apply (a)common sense, (b)spatial reasoning, and (c)temporal reasoning to refine the predictions yielded by the extant V-HOI detection model.
  • Figure 4: The auxiliary training strategy using CLIP feature for regularization.
  • Figure 5: Some visual results of V-HOI MLCR. Our framework uses the scores of LLMs to raise the score of potential correct triplet and above the preset threshold so that they are judged as correct predictions that match the label.