TP-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making
Shanshan Li, Da Huang, Yu He, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue
TL;DR
The paper tackles long-horizon navigation tasks with multiple needs by introducing Tasked-Preferenced Multi-Demand-Driven Navigation (TP-MDDN) and a memory-rich embodied AI framework. It proposes the Autonomous Decision-Making World Model System (AWMSystem) consisting of BreakLLM, LocateLLM, and StatusMLLM, paired with the Multidimensional Accumulated Semantic Map (MASMap) and a Dual-Tempo Action Generator to balance deep reasoning with efficient control. Empirical results on AI2-THOR and ProcTHOR show that the approach achieves superior perception accuracy and navigation robustness compared to state-of-the-art baselines, while maintaining reasonable inference times. This work enhances embodied AI by enabling explicit task preferences and multi-subtask planning without requiring end-to-end retraining, with practical implications for real-world service robots and virtual agents.
Abstract
In daily life, people often move through spaces to find objects that meet their needs, posing a key challenge in embodied AI. Traditional Demand-Driven Navigation (DDN) handles one need at a time but does not reflect the complexity of real-world tasks involving multiple needs and personal choices. To bridge this gap, we introduce Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN), a new benchmark for long-horizon navigation involving multiple sub-demands with explicit task preferences. To solve TP-MDDN, we propose AWMSystem, an autonomous decision-making system composed of three key modules: BreakLLM (instruction decomposition), LocateLLM (goal selection), and StatusMLLM (task monitoring). For spatial memory, we design MASMap, which combines 3D point cloud accumulation with 2D semantic mapping for accurate and efficient environmental understanding. Our Dual-Tempo action generation framework integrates zero-shot planning with policy-based fine control, and is further supported by an Adaptive Error Corrector that handles failure cases in real time. Experiments demonstrate that our approach outperforms state-of-the-art baselines in both perception accuracy and navigation robustness.
