Table of Contents
Fetching ...

Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation

Badi Li, Ren-jie Lu, Yu Zhou, Jingke Meng, Wei-shi Zheng

TL;DR

GOAL tackles ObjectGoal Navigation by addressing uncertainty in unseen indoor layouts with a generative flow model that is primed by LLM-derived priors. It distills contextual knowledge into data-dependent couplings between partial and full semantic maps, enabling the agent to imagine plausible unobserved regions and select informative long-horizon waypoints. The approach combines 3D scene understanding, scene segmentation, and flow-based completion trained via optimal-transport interpolation, achieving state-of-the-art results on Gibson and MP3D and strong cross-dataset transfer to HM3D. This work demonstrates that integrating structured LLM guidance with fast-flow sampling yields better generalization for embodied agents in unseen environments, with practical impact for robust navigation in real-world robotics and AI assistants.

Abstract

The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at https://github.com/Badi-Li/GOAL.

Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation

TL;DR

GOAL tackles ObjectGoal Navigation by addressing uncertainty in unseen indoor layouts with a generative flow model that is primed by LLM-derived priors. It distills contextual knowledge into data-dependent couplings between partial and full semantic maps, enabling the agent to imagine plausible unobserved regions and select informative long-horizon waypoints. The approach combines 3D scene understanding, scene segmentation, and flow-based completion trained via optimal-transport interpolation, achieving state-of-the-art results on Gibson and MP3D and strong cross-dataset transfer to HM3D. This work demonstrates that integrating structured LLM guidance with fast-flow sampling yields better generalization for embodied agents in unseen environments, with practical impact for robust navigation in real-world robotics and AI assistants.

Abstract

The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at https://github.com/Badi-Li/GOAL.

Paper Structure

This paper contains 49 sections, 21 equations, 10 figures, 8 tables, 2 algorithms.

Figures (10)

  • Figure 1: Overview of GOAL framework, with navigation (inference) in blue and training in green. (a) shows the navigation pipeline where the agent imagines future maps using a flow-guided model. (b) illustrates how we prompt a LLM with hierarchical instructions to generate contextual priors (for full prompt and response see Appendix \ref{['app: prompt and response']}). (c) visualizes how we use LLM priors to construct data-dependent couplings. (d) demonstrates how the flow model is trained using these couplings through interpolated velocity supervision.
  • Figure 2: Visualization of navigation with GOAL on MP3D (val). The top row shows RGB observations and agent trajectories; the bottom row displays the observed semantic maps and generated full-scene maps.
  • Figure 3: Comparison between the simulated visible area and actual visible area given the agent position (red dot). The left shows the simulated mask adopted by PONI and our work, while the right shows the actual mask, revealing a substantial gap.
  • Figure 4: Effect of the number of Euler steps $n$ on navigation performance.
  • Figure 5: Tuning curve for hyper-parameter $\epsilon$ across different LLMs.
  • ...and 5 more figures