Table of Contents
Fetching ...

Towards Real-time Adaptation of Embodied Agent in Human-Robot Collaboration

Shipeng Liu, Boshen Zhang, Zhehui Huang

TL;DR

This work tackles the challenge of real-time proactive adaptation in human–robot collaboration by presenting MonTA, a hierarchical framework that splits fast environmental monitoring (System 1) from deliberate adaptation reasoning (System 2). A fine-grained Overcooked-AI benchmark evaluates the ability to detect when adaptation is needed and to produce high-quality subtask and path adaptations under sub-second constraints, using a fast embedding-based Monitor and GPT-4o-driven Subtask and Path Adapters. Experimental results show MonTA outperforms baseline agents across diverse layouts with varying teaming fluency, and a human user study confirms that MonTA's adaptation plans and language instructions are reasonably consistent. The work demonstrates a practical pathway to real-time, proactive human–robot collaboration and highlights latency–accuracy trade-offs, with future directions including personalization and action-level planning enhancements.

Abstract

Large Language Models (LLMs) have opened transformative possibilities for human-robot collaboration. However, enabling real-time collaboration requires both low latency and robust reasoning, and most LLMs suffer from high latency. To address this gap, we first propose a fine-grained benchmark that explicitly assesses agents' proactive adaptability and temporal responsiveness in the Overcooked-AI environment. Based on evaluation results, we propose MonTA (Monitor-then-Adapt), a hierarchical framework inspired by cognitive science research. MonTA contains three key modules: a lightweight Monitor that operates at high frequency (7 Hz) to detect adaptation needs, and two proficient Adapters for subtask and path adaptation reasoning that provide instructions to humans at a lower frequency. Our results demonstrate that MonTA significantly outperforms baseline agents on our proposed benchmark, achieving superior performance across layouts with varying teaming fluency. User studies confirm the high reasonableness of adaptation plans and consistent language instructions provided by our framework to humans.

Towards Real-time Adaptation of Embodied Agent in Human-Robot Collaboration

TL;DR

This work tackles the challenge of real-time proactive adaptation in human–robot collaboration by presenting MonTA, a hierarchical framework that splits fast environmental monitoring (System 1) from deliberate adaptation reasoning (System 2). A fine-grained Overcooked-AI benchmark evaluates the ability to detect when adaptation is needed and to produce high-quality subtask and path adaptations under sub-second constraints, using a fast embedding-based Monitor and GPT-4o-driven Subtask and Path Adapters. Experimental results show MonTA outperforms baseline agents across diverse layouts with varying teaming fluency, and a human user study confirms that MonTA's adaptation plans and language instructions are reasonably consistent. The work demonstrates a practical pathway to real-time, proactive human–robot collaboration and highlights latency–accuracy trade-offs, with future directions including personalization and action-level planning enhancements.

Abstract

Large Language Models (LLMs) have opened transformative possibilities for human-robot collaboration. However, enabling real-time collaboration requires both low latency and robust reasoning, and most LLMs suffer from high latency. To address this gap, we first propose a fine-grained benchmark that explicitly assesses agents' proactive adaptability and temporal responsiveness in the Overcooked-AI environment. Based on evaluation results, we propose MonTA (Monitor-then-Adapt), a hierarchical framework inspired by cognitive science research. MonTA contains three key modules: a lightweight Monitor that operates at high frequency (7 Hz) to detect adaptation needs, and two proficient Adapters for subtask and path adaptation reasoning that provide instructions to humans at a lower frequency. Our results demonstrate that MonTA significantly outperforms baseline agents on our proposed benchmark, achieving superior performance across layouts with varying teaming fluency. User studies confirm the high reasonableness of adaptation plans and consistent language instructions provided by our framework to humans.

Paper Structure

This paper contains 28 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The Overcooked-AI simulator (A) The cooking procedure to finish one order. (B) The game interface that we use to test agents and conduct user studies.
  • Figure 2: Benchmark for evaluating LLM-based agents' real-time adaptation capabilities. (A) Six selected representative layouts with different teaming fluency from 85.3% to 16.7%. The red cross represents a critical point that would interfere with another agent's workflow. (B) Three selected representative path adaptation testing frames designed by human experts: self-adapt, other-adapt, and both-ok types, viewed from the perspective of the blue agent. The subtask goal locations for the blue and green agents are marked as blue "G" and green "G", respectively, with their greedy paths shown as arrowed lines. The blue agent is giving language instruction. (C) Three representative subtask adaptation testing frames where the blue agent is giving language instructions.
  • Figure 3: LLM capability and latency evaluation. Success rates are reported for determining whether adaptation is needed ($SR_m$), generating subtask adaptation plans ($SR_{sa}$), and generating path adaptation plans ($SR_{pa}$), along with their corresponding average execution frequencies $f_m$, $f_{sa}$, and $f_{pa}$. Frequencies are normalized to the minimum required for real-time collaboration: 10 Hz for monitoring, and 0.5 Hz for both subtask and path adaptation. Frequencies exceeding these thresholds are cropped. The two circles denote results from the embedding-based classifier, for which adaptation test results are not available.
  • Figure 4: MonTA Framework. The framework comprises a real-time monitor and two primary adapter modules: the subtask adapter and the path adapter. The monitor operates at a high frequency to continuously assess the collaboration status and determine whether adaptation is necessary. The adapters are invoked only upon the monitor's request, and they decide how language instructions should be sent to the communication adapter to guide the human collaborator.
  • Figure 5: Overall evaluation results. The average score comparison between different agent pairs includes MonTA (ours) v.s. greedy, SAA v.s. greedy, and greedy v.s. greedy.
  • ...and 2 more figures