Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM
Chiori Hori, Yoshiki Masuyama, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux
TL;DR
The paper tackles long-horizon human-robot interaction by enabling robust robot action confirmation and micro-step planning from multimodal inputs. It introduces a long-context Q-former to leverage full-video context and a text-conditioning mechanism to feed linguistic information directly into the LLM decoder, with VideoLLaMA3 augmenting textual cues. Empirical results on YouCook2 show that long-context modeling yields consistent gains in action confirmation and planning, which are further amplified by text conditioning, achieving the best performance when combined. This approach advances multimodal scene understanding for HRI and has practical implications for cooking-domain robotics and beyond, where long-range task dependencies matter.
Abstract
Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.
