Versatile Navigation under Partial Observability via Value-guided Diffusion Policy

Gengyu Zhang; Hao Tang; Yan Yan

Versatile Navigation under Partial Observability via Value-guided Diffusion Policy

Gengyu Zhang, Hao Tang, Yan Yan

TL;DR

This work tackles navigation under partial observability by uniting trajectory-level diffusion planning with value-guided guidance. A diffusion-based plan generator conditions on partial environment maps, while a state-estimation–assisted QMDP value function guides exploration and backtracks to select high-value trajectories, addressing dead ends and long-horizon challenges. The approach enables zero-shot transfer from 2D to 3D through semantic BEV projection of RGB-D data, and further gains from RGB-D retraining, achieving state-of-the-art or competitive results on GridMaze2D and the AVD benchmark. The results demonstrate robust performance and adaptability across 2D and 3D navigation tasks, highlighting the method's potential for real-world autonomous systems operating under partial observability.

Abstract

Route planning for navigation under partial observability plays a crucial role in modern robotics and autonomous driving. Existing route planning approaches can be categorized into two main classes: traditional autoregressive and diffusion-based methods. The former often fails due to its myopic nature, while the latter either assumes full observability or struggles to adapt to unfamiliar scenarios, due to strong couplings with behavior cloning from experts. To address these deficiencies, we propose a versatile diffusion-based approach for both 2D and 3D route planning under partial observability. Specifically, our value-guided diffusion policy first generates plans to predict actions across various timesteps, providing ample foresight to the planning. It then employs a differentiable planner with state estimations to derive a value function, directing the agent's exploration and goal-seeking behaviors without seeking experts while explicitly addressing partial observability. During inference, our policy is further enhanced by a best-plan-selection strategy, substantially boosting the planning success rate. Moreover, we propose projecting point clouds, derived from RGB-D inputs, onto 2D grid-based bird-eye-view maps via semantic segmentation, generalizing to 3D environments. This simple yet effective adaption enables zero-shot transfer from 2D-trained policy to 3D, cutting across the laborious training for 3D policy, and thus certifying our versatility. Experimental results demonstrate our superior performance, particularly in navigating situations beyond expert demonstrations, surpassing state-of-the-art autoregressive and diffusion-based baselines for both 2D and 3D scenarios.

Versatile Navigation under Partial Observability via Value-guided Diffusion Policy

TL;DR

Abstract

Paper Structure (16 sections, 10 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 10 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work and Preliminary
Methodology
Problem Formulation
Diffusion-model-based Plan Generation
Value-guided Exploration-safe Planning
2D to 3D Policy Transfer
Experiments
Task Setups
Result Analysis
Ablation Study
Conclusion
Network Architectural and Experimental Specification
Additional Qualitative Results
Additional Quantitative Results
...and 1 more sections

Figures (7)

Figure 1: Our value-guided diffusion policy under partial observability. It processes local partial observations to generate action sequences adaptable for both 2D and 3D scenarios.
Figure 2: The architecture of diffusion-model-based plan generator. The top sequence represents local observations over time. The grids in the middle form the sequence of cumulative partial maps, which sufficiently encapsulate the agent's long-term memory and environment features. The bottom sequence represents the generated plan in the form of action trajectories. During training, the input of the framework at timestep $t$ consists of the partial map, $\bm{e}_{(t)}$, and expert action trajectory, ${\bm{\tau}_a}_{(t)}$; during inference, the input comprises $\bm{e}_{(t)}$ and a Gaussian noise of the same shape as ${\bm{\tau}_a}_{(t)}$.
Figure 3: Reward function conditioned on the partial environmental map. The model learns a valid action mask that filters out invalid actions using soft thresholding. This learned embedding is subsequently used to construct the reward function.
Figure 4: QMDP value iteration module. The learned reward function undergoes $K$ rounds of iterations, consisting of alternating maximization over actions and convolution with the transition function $\hat{T}_m$. The outcome, soft-indexed by the current belief, derives the final action values of this QMDP planner.
Figure 5: An illustration of constructing a point cloud for a given scene and its subsequent projection onto a BEV map. In this specific example, objects such as the table, chair, and various other furniture pieces in the kitchen, the two sofas and television cabinets in the living room, and the surrounding walls are identified as obstacles on the BEV map. Conversely, areas of the floor that remain uncovered by any objects are designated as free space.
...and 2 more figures

Versatile Navigation under Partial Observability via Value-guided Diffusion Policy

TL;DR

Abstract

Versatile Navigation under Partial Observability via Value-guided Diffusion Policy

Authors

TL;DR

Abstract

Table of Contents

Figures (7)