Table of Contents
Fetching ...

To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation

Apoorva Vashisth, Manav Kulshrestha, Pranav Bakshi, Damon Conover, Guillaume Sartoretti, Aniket Bera

TL;DR

An LLM-driven, constraint-based planning framework with active perception that allows the LLM to reason over a structured scene graph of discovered objects and obstacles, deciding which object to move, where to place it, and where to look next to discover task-relevant information is proposed.

Abstract

Visual navigation typically assumes the existence of at least one obstacle-free path between start and goal, which must be discovered/planned by the robot. However, in real-world scenarios, such as home environments and warehouses, clutter can block all routes. Targeted at such cases, we introduce the Lifelong Interactive Navigation problem, where a mobile robot with manipulation abilities can move clutter to forge its own path to complete sequential object- placement tasks - each involving placing an given object (eg. Alarm clock, Pillow) onto a target object (eg. Dining table, Desk, Bed). To address this lifelong setting - where effects of environment changes accumulate and have long-term effects - we propose an LLM-driven, constraint-based planning framework with active perception. Our framework allows the LLM to reason over a structured scene graph of discovered objects and obstacles, deciding which object to move, where to place it, and where to look next to discover task-relevant information. This coupling of reasoning and active perception allows the agent to explore the regions expected to contribute to task completion rather than exhaustively mapping the environment. A standard motion planner then executes the corresponding navigate-pick-place, or detour sequence, ensuring reliable low-level control. Evaluated in physics-enabled ProcTHOR-10k simulator, our approach outperforms non-learning and learning-based baselines. We further demonstrate our approach qualitatively on real-world hardware.

To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation

TL;DR

An LLM-driven, constraint-based planning framework with active perception that allows the LLM to reason over a structured scene graph of discovered objects and obstacles, deciding which object to move, where to place it, and where to look next to discover task-relevant information is proposed.

Abstract

Visual navigation typically assumes the existence of at least one obstacle-free path between start and goal, which must be discovered/planned by the robot. However, in real-world scenarios, such as home environments and warehouses, clutter can block all routes. Targeted at such cases, we introduce the Lifelong Interactive Navigation problem, where a mobile robot with manipulation abilities can move clutter to forge its own path to complete sequential object- placement tasks - each involving placing an given object (eg. Alarm clock, Pillow) onto a target object (eg. Dining table, Desk, Bed). To address this lifelong setting - where effects of environment changes accumulate and have long-term effects - we propose an LLM-driven, constraint-based planning framework with active perception. Our framework allows the LLM to reason over a structured scene graph of discovered objects and obstacles, deciding which object to move, where to place it, and where to look next to discover task-relevant information. This coupling of reasoning and active perception allows the agent to explore the regions expected to contribute to task completion rather than exhaustively mapping the environment. A standard motion planner then executes the corresponding navigate-pick-place, or detour sequence, ensuring reliable low-level control. Evaluated in physics-enabled ProcTHOR-10k simulator, our approach outperforms non-learning and learning-based baselines. We further demonstrate our approach qualitatively on real-world hardware.
Paper Structure (31 sections, 9 equations, 8 figures, 8 tables)

This paper contains 31 sections, 9 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Demonstration of our approach deployed on the Boston Dynamics Spot robot. The left image shows the region of the environment that is the robot's operating area for the current task, initially unknown. The robot first observes the scene and determines presence of task-relevant objects (Bottle and Desk in this case). Based on the observations and the current task progress, the robot determines the best course of action ("Bring RedBottle to Desk2"). Here, based on the current episode progress ($4$ out of total $20$ tasks completed) and cost-benefit analysis (moving potential clutter vs directly completing the task), it decides to first optimize the environment by moving clutter (the paper towel roll) and storing it at a carefully chosen spot out of the way (the black box). The robot is then able to place the red bottle on the desk.
  • Figure 2: Overview of our constraint-based planning framework for interactive navigation. At each timestep, our agent receives observations from the environment, which are utilized by our perception module to update the scene graph. Our scene graph contains the observed objects of the environment as nodes and the blocking relation among them as edges. Each node has its own attributes, which provide additional navigability and manipulability context to the LLM. The LLM then decides whether to explore the environment further to collect additional information or to attempt to complete the given task based on the scene graph content and the environment constraints.
  • Figure 3: Example dataset generation process. In this instance, we consider a $10$ room floorplan from ProcTHOR-$10$k. The left image shows the ideal environment without clutter and completely free for optimal navigation throughout the environment. We then generate clutter in the free space using the between-ness centrality of nodes in $G_{\text{free}}$. The generated obstacle locations are indicated by red circles in the middle image. This causes the blocking of the paths within the accessible rooms, as well as eliminating access to the two rooms on bottom-right of the map, as shown in the right image.
  • Figure 4: Lifetime Efficiency Score (LES) plott for floorplans ranging from $1$ to $10$ rooms. Higher is better. In small environments, nearly all baselines appear competitive because suboptimal actions -- such as excessive detours (Detour) or indiscriminate manipulation (Interact, Cleanup) -- incur limited long-term cost. As spatial complexity increases, these naïve policies collapse: detour-heavy strategies suffer from compounding path inefficiency, while cleanup-heavy strategies incur large manipulation overhead. InterNav fails to scale due to brittle obstacle-handling assumptions. Our method remains consistently stable across environment sizes. The widening performance gap highlights the need for selective, globally informed interaction when navigating lifelong, cluttered environments.
  • Figure 5: Point-cloud–driven environment mapping during real-world deployment. We visualize the evolution of the robot’s perception and its derived navigation graph during a real episode on the Boston Dynamics Spot. (a) The initial RGB-D observations from the front fisheye cameras produce a sparse and noisy point cloud reflecting partial scene coverage. (b) From this point cloud, the system constructs an initial navigation grid graph by projecting points onto the 2D floor plane and classifying nodes as free (gray) or occupied (red) based on local point density and presence of obstacles. (c) As the robot actively explores and executes navigation and manipulation actions, the accumulated point cloud becomes significantly more complete, revealing previously unseen regions and freed regions due to obstacle relocation. (d) The final grid graph incorporates these expanded observations, yielding a more accurate and globally connected representation of traversable space and obstacles. This perception-mapping loop provides the structured scene graph used by our LLM planner to reason about exploration, obstacle relocation, and long-horizon navigation decisions.
  • ...and 3 more figures