Table of Contents
Fetching ...

TP-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making

Shanshan Li, Da Huang, Yu He, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue

TL;DR

The paper tackles long-horizon navigation tasks with multiple needs by introducing Tasked-Preferenced Multi-Demand-Driven Navigation (TP-MDDN) and a memory-rich embodied AI framework. It proposes the Autonomous Decision-Making World Model System (AWMSystem) consisting of BreakLLM, LocateLLM, and StatusMLLM, paired with the Multidimensional Accumulated Semantic Map (MASMap) and a Dual-Tempo Action Generator to balance deep reasoning with efficient control. Empirical results on AI2-THOR and ProcTHOR show that the approach achieves superior perception accuracy and navigation robustness compared to state-of-the-art baselines, while maintaining reasonable inference times. This work enhances embodied AI by enabling explicit task preferences and multi-subtask planning without requiring end-to-end retraining, with practical implications for real-world service robots and virtual agents.

Abstract

In daily life, people often move through spaces to find objects that meet their needs, posing a key challenge in embodied AI. Traditional Demand-Driven Navigation (DDN) handles one need at a time but does not reflect the complexity of real-world tasks involving multiple needs and personal choices. To bridge this gap, we introduce Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN), a new benchmark for long-horizon navigation involving multiple sub-demands with explicit task preferences. To solve TP-MDDN, we propose AWMSystem, an autonomous decision-making system composed of three key modules: BreakLLM (instruction decomposition), LocateLLM (goal selection), and StatusMLLM (task monitoring). For spatial memory, we design MASMap, which combines 3D point cloud accumulation with 2D semantic mapping for accurate and efficient environmental understanding. Our Dual-Tempo action generation framework integrates zero-shot planning with policy-based fine control, and is further supported by an Adaptive Error Corrector that handles failure cases in real time. Experiments demonstrate that our approach outperforms state-of-the-art baselines in both perception accuracy and navigation robustness.

TP-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making

TL;DR

The paper tackles long-horizon navigation tasks with multiple needs by introducing Tasked-Preferenced Multi-Demand-Driven Navigation (TP-MDDN) and a memory-rich embodied AI framework. It proposes the Autonomous Decision-Making World Model System (AWMSystem) consisting of BreakLLM, LocateLLM, and StatusMLLM, paired with the Multidimensional Accumulated Semantic Map (MASMap) and a Dual-Tempo Action Generator to balance deep reasoning with efficient control. Empirical results on AI2-THOR and ProcTHOR show that the approach achieves superior perception accuracy and navigation robustness compared to state-of-the-art baselines, while maintaining reasonable inference times. This work enhances embodied AI by enabling explicit task preferences and multi-subtask planning without requiring end-to-end retraining, with practical implications for real-world service robots and virtual agents.

Abstract

In daily life, people often move through spaces to find objects that meet their needs, posing a key challenge in embodied AI. Traditional Demand-Driven Navigation (DDN) handles one need at a time but does not reflect the complexity of real-world tasks involving multiple needs and personal choices. To bridge this gap, we introduce Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN), a new benchmark for long-horizon navigation involving multiple sub-demands with explicit task preferences. To solve TP-MDDN, we propose AWMSystem, an autonomous decision-making system composed of three key modules: BreakLLM (instruction decomposition), LocateLLM (goal selection), and StatusMLLM (task monitoring). For spatial memory, we design MASMap, which combines 3D point cloud accumulation with 2D semantic mapping for accurate and efficient environmental understanding. Our Dual-Tempo action generation framework integrates zero-shot planning with policy-based fine control, and is further supported by an Adaptive Error Corrector that handles failure cases in real time. Experiments demonstrate that our approach outperforms state-of-the-art baselines in both perception accuracy and navigation robustness.

Paper Structure

This paper contains 12 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: This figure presents our autonomous decision-making process and its performance benefits. (a) Overview: The pink area shows subtask selection using landmark semantic memory; the yellow area explains Dual-Tempo action generation via generalist and specialist policies; the green area details dynamic subtask completion checks. (b) Performance: Our method achieves a $16\%$ higher success rate than DDN DDN and InstructNav long2024instructnav under the TP-MDDN benchmark, with superior efficiency and individual success rates, highlighting its effectiveness and reliability.
  • Figure 2: Overview. This diagram illustrates the dual-tempo action generation process in our system. The BreakLLM module decomposes the instruction. Then, depth images are converted into 2D semantic points. The fast-tempo branch uses a pretrained policy to generate primitive actions, while the slow-tempo branch employs LocateLLM for high-level navigation reasoning, determining target objects and positions. StatusMLLM tracks task progress and updates memory. The Navigation Network performs affordance map computation, adaptive error correction, and waypoint prediction.
  • Figure 3: Foundation Model Usage. BreakLLM decomposes the instruction. The agent uses Ram-Grounded-Sam vlm-groundedsamvlm-groundingdino to segment panoramic RGB-D images and ultimately map them onto 2D semantic maps to form object memory. LocateLLM receives multiple types of data and outputs the next target object and position. StatusMLLM determines whether a subtask has been completed based on the current observed image. Adaptive error corrector re-plans the failed trajectory.
  • Figure 4: Visualization Results. The intelligent agent receives a Task-Preferenced Multi-Demand-Driven instruction, autonomously decomposes it into multiple subtasks, and identifies objects in the scene that match the unexecuted subtasks. On the affordance maps, redder values indicate higher affordance scores. The arrow in the waypoint predictor graph represents the agent's location and current field of view. As the step count increases, the three subtasks are gradually completed.