Towards Real-time Adaptation of Embodied Agent in Human-Robot Collaboration
Shipeng Liu, Boshen Zhang, Zhehui Huang
TL;DR
This work tackles the challenge of real-time proactive adaptation in human–robot collaboration by presenting MonTA, a hierarchical framework that splits fast environmental monitoring (System 1) from deliberate adaptation reasoning (System 2). A fine-grained Overcooked-AI benchmark evaluates the ability to detect when adaptation is needed and to produce high-quality subtask and path adaptations under sub-second constraints, using a fast embedding-based Monitor and GPT-4o-driven Subtask and Path Adapters. Experimental results show MonTA outperforms baseline agents across diverse layouts with varying teaming fluency, and a human user study confirms that MonTA's adaptation plans and language instructions are reasonably consistent. The work demonstrates a practical pathway to real-time, proactive human–robot collaboration and highlights latency–accuracy trade-offs, with future directions including personalization and action-level planning enhancements.
Abstract
Large Language Models (LLMs) have opened transformative possibilities for human-robot collaboration. However, enabling real-time collaboration requires both low latency and robust reasoning, and most LLMs suffer from high latency. To address this gap, we first propose a fine-grained benchmark that explicitly assesses agents' proactive adaptability and temporal responsiveness in the Overcooked-AI environment. Based on evaluation results, we propose MonTA (Monitor-then-Adapt), a hierarchical framework inspired by cognitive science research. MonTA contains three key modules: a lightweight Monitor that operates at high frequency (7 Hz) to detect adaptation needs, and two proficient Adapters for subtask and path adaptation reasoning that provide instructions to humans at a lower frequency. Our results demonstrate that MonTA significantly outperforms baseline agents on our proposed benchmark, achieving superior performance across layouts with varying teaming fluency. User studies confirm the high reasonableness of adaptation plans and consistent language instructions provided by our framework to humans.
