Table of Contents
Fetching ...

CoFL: Continuous Flow Fields for Language-Conditioned Navigation

Haokun Liu, Zhaoqi Ma, Yicheng Chen, Masaki Kitagawa, Wentao Zhang, Jinjie Li, Moju Zhao

TL;DR

CoFL, an end-to-end policy that directly maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation, significantly outperforms modular Vision-Language Model (VLM)-based planners and generative policy baselines on strictly unseen scenes.

Abstract

Language-conditioned navigation pipelines often rely on brittle modular components or costly action-sequence generation. To address these limitations, we present CoFL, an end-to-end policy that directly maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. Instead of predicting discrete action tokens or sampling action chunks via iterative denoising, CoFL outputs instantaneous velocities that can be queried at arbitrary 2D projected locations. Trajectories are obtained by numerical integration of the predicted field, producing smooth motion that remains reactive under closed-loop execution. To enable large-scale training, we build a dataset of over 500k BEV image-instruction pairs, each procedurally annotated with a flow field and a trajectory derived from BEV semantic maps built on Matterport3D and ScanNet. By training on a mixed distribution, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and generative policy baselines on strictly unseen scenes. Finally, we deploy CoFL zero-shot in real-world experiments with overhead BEV observations across multiple layouts, maintaining reliable closed-loop control and a high success rate.

CoFL: Continuous Flow Fields for Language-Conditioned Navigation

TL;DR

CoFL, an end-to-end policy that directly maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation, significantly outperforms modular Vision-Language Model (VLM)-based planners and generative policy baselines on strictly unseen scenes.

Abstract

Language-conditioned navigation pipelines often rely on brittle modular components or costly action-sequence generation. To address these limitations, we present CoFL, an end-to-end policy that directly maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. Instead of predicting discrete action tokens or sampling action chunks via iterative denoising, CoFL outputs instantaneous velocities that can be queried at arbitrary 2D projected locations. Trajectories are obtained by numerical integration of the predicted field, producing smooth motion that remains reactive under closed-loop execution. To enable large-scale training, we build a dataset of over 500k BEV image-instruction pairs, each procedurally annotated with a flow field and a trajectory derived from BEV semantic maps built on Matterport3D and ScanNet. By training on a mixed distribution, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and generative policy baselines on strictly unseen scenes. Finally, we deploy CoFL zero-shot in real-world experiments with overhead BEV observations across multiple layouts, maintaining reliable closed-loop control and a high success rate.
Paper Structure (101 sections, 42 equations, 11 figures, 11 tables, 1 algorithm)

This paper contains 101 sections, 42 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of the main contributions. (a) An automated annotation pipeline that constructs a large-scale BEV image--instruction dataset with procedurally generated flow-field and trajectory references (500k+ samples) from indoor 3D scenes. (b) CoFL: a transformer-based language-conditioned policy that predicts continuous flow fields, enabling smooth trajectory rollout with a controllable inference budget. (c) Benchmark results on unseen environments, demonstrating more precise and safer navigation. (d) Zero-shot transfer to real-world robot navigation using a model trained only on the proposed dataset.
  • Figure 2: Overview of the CoFL's network architecture. Given a RGB BEV observation $I$ and a language instruction $\ell$, a SigLIP 2-based zhai2023sigmoidtschannen2025siglip2 vision--language encoder produces language-conditioned context tokens over the BEV image. The decoder then queries this context with 2D normalized spatial coordinates $\mathbf{X}$ and outputs the corresponding velocities $\mathbf{V}$, forming a continuous flow field $\mathbf{v}(\mathbf{x} \mid I, \ell)$ via bilinear interpolation of discrete velocity predictions.
  • Figure 3: Overview of the trajectory inference. The predicted flow field over the workspace guides agents from different starts toward the same goal while smoothly avoiding obstacles.
  • Figure 4: Overview of the proposed visual observation generation pipeline. Images are captured from a multi-view camera array. (a)-(b) Top-down view RGB and semantic segmentation maps. (c)-(d) Oblique view sample.
  • Figure 5: Overview of the procedural annotation pipeline. We combine cost-weighted geodesic distance and obstacle repulsion to derive the field and trajectory annotations.
  • ...and 6 more figures