Table of Contents
Fetching ...

From Dialogue to Execution: Mixture-of-Agents Assisted Interactive Planning for Behavior Tree-Based Long-Horizon Robot Execution

Kanata Suzuki, Kazuki Hori, Haruka Miyoshi, Shuhei Kurita, Tetsuya Ogata

TL;DR

Experiments on cocktail-making tasks show that the proposed MoA-assisted interactive planning improves dialogue efficiency while preserving execution quality in real-world robotic tasks, and indicates that MoA-assisted interactive planning improves dialogue efficiency while preserving execution quality in real-world robotic tasks.

Abstract

Interactive task planning with large language models (LLMs) enables robots to generate high-level action plans from natural language instructions. However, in long-horizon tasks, such approaches often require many questions, increasing user burden. Moreover, flat plan representations become difficult to manage as task complexity grows. We propose a framework that integrates Mixture-of-Agents (MoA)-based proxy answering into interactive planning and generates Behavior Trees (BTs) for structured long-term execution. The MoA consists of multiple LLM-based expert agents that answer general or domain-specific questions when possible, reducing unnecessary human intervention. The resulting BT hierarchically represents task logic and enables retry mechanisms and dynamic switching among multiple robot policies. Experiments on cocktail-making tasks show that the proposed method reduces human response requirements by approximately 27% while maintaining structural and semantic similarity to fully human-answered BTs. Real-robot experiments on a smoothie-making task further demonstrate successful long-horizon execution with adaptive policy switching and recovery from action failures. These results indicate that MoA-assisted interactive planning improves dialogue efficiency while preserving execution quality in real-world robotic tasks.

From Dialogue to Execution: Mixture-of-Agents Assisted Interactive Planning for Behavior Tree-Based Long-Horizon Robot Execution

TL;DR

Experiments on cocktail-making tasks show that the proposed MoA-assisted interactive planning improves dialogue efficiency while preserving execution quality in real-world robotic tasks, and indicates that MoA-assisted interactive planning improves dialogue efficiency while preserving execution quality in real-world robotic tasks.

Abstract

Interactive task planning with large language models (LLMs) enables robots to generate high-level action plans from natural language instructions. However, in long-horizon tasks, such approaches often require many questions, increasing user burden. Moreover, flat plan representations become difficult to manage as task complexity grows. We propose a framework that integrates Mixture-of-Agents (MoA)-based proxy answering into interactive planning and generates Behavior Trees (BTs) for structured long-term execution. The MoA consists of multiple LLM-based expert agents that answer general or domain-specific questions when possible, reducing unnecessary human intervention. The resulting BT hierarchically represents task logic and enables retry mechanisms and dynamic switching among multiple robot policies. Experiments on cocktail-making tasks show that the proposed method reduces human response requirements by approximately 27% while maintaining structural and semantic similarity to fully human-answered BTs. Real-robot experiments on a smoothie-making task further demonstrate successful long-horizon execution with adaptive policy switching and recovery from action failures. These results indicate that MoA-assisted interactive planning improves dialogue efficiency while preserving execution quality in real-world robotic tasks.
Paper Structure (14 sections, 1 equation, 6 figures, 5 tables)

This paper contains 14 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the proposed MoA-assisted interactive planning framework. A natural language task instruction is refined through dialogue between an LLM planner and MoA-based proxy responders. The refined specification is converted into a Behavior Tree, which is executed on a real robot for long-horizon task completion.
  • Figure 2: Detailed pipeline of the proposed framework. Given a task instruction, the LLM performs uncertainty analysis in the generated BT and generates clarification questions. The MoA-based proxy-response mechanism resolves answerable questions using domain-specific, robot-specific, or commonsense expertise. Once ambiguity is resolved, BT's action nodes are assigned appropriate learning models for execution.
  • Figure 3: Prompt design for the Mixture-of-Agents framework. Three expert agents--Robot Expert, Task Domain Expert, and Commonsense Expert--follow a structured response format consisting of answerability analysis, partial answering, and delegation of unresolved questions. This design enables collaborative proxy answering while preserving uncertainty for human resolution.
  • Figure 4: Experimental setup for the smoothie-making task. The dual-arm robot performs fruit insertion, lid manipulation, and switch operation.
  • Figure 5: Behavior Tree generated through MoA-assisted interactive planning. The figure shows the hierarchical BT structure for the smoothie-making task along with example question–answer interactions. Robot-related clarification questions are answered by the Robot Expert agent, while user-preference-related questions are resolved by the human. Different action nodes are assigned either Diffusion Policy or $\pi_{0.5}$ models.
  • ...and 1 more figures