Table of Contents
Fetching ...

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

Chiori Hori, Yoshiki Masuyama, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

TL;DR

The paper tackles long-horizon human-robot interaction by enabling robust robot action confirmation and micro-step planning from multimodal inputs. It introduces a long-context Q-former to leverage full-video context and a text-conditioning mechanism to feed linguistic information directly into the LLM decoder, with VideoLLaMA3 augmenting textual cues. Empirical results on YouCook2 show that long-context modeling yields consistent gains in action confirmation and planning, which are further amplified by text conditioning, achieving the best performance when combined. This approach advances multimodal scene understanding for HRI and has practical implications for cooking-domain robotics and beyond, where long-range task dependencies matter.

Abstract

Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

TL;DR

The paper tackles long-horizon human-robot interaction by enabling robust robot action confirmation and micro-step planning from multimodal inputs. It introduces a long-context Q-former to leverage full-video context and a text-conditioning mechanism to feed linguistic information directly into the LLM decoder, with VideoLLaMA3 augmenting textual cues. Empirical results on YouCook2 show that long-context modeling yields consistent gains in action confirmation and planning, which are further amplified by text conditioning, achieving the best performance when combined. This approach advances multimodal scene understanding for HRI and has practical implications for cooking-domain robotics and beyond, where long-range task dependencies matter.

Abstract

Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.

Paper Structure

This paper contains 10 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Robot action confirmation sentences in natural language are simultaneously generated with micro-step action sequences. The multimodal features, such as videos, images, audio, and speech, are extracted from the human demonstration videos, and multimodal LLM models are trained to generate robot action descriptions to confirm whether the action is correct before taking the action. Although a humanoid robot is illustrated in this example, the action steps are designed for commoditized single-arm robots.
  • Figure 2: Q-former-based confirmation sentence generation and action planning model hori_2025_ICASSP. AVBLIP-based action generation with a Q-former. The generated embeddings are fed to the LLM decoder.
  • Figure 3: Long-context Q-former-based confirmation sentence generation and action planning model. AVBLIP-based action generation with two Q-former modules, one generates token embeddings from the current video clip and the other generates the embeddings from the surrounding video clips. The generated embeddings are combined with a transformer encoder and fed to the LLM decoder.