Table of Contents
Fetching ...

OctoNav: Towards Generalist Embodied Navigation

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, Si Liu

TL;DR

This paper introduces OctoNav-Bench, a large-scale benchmark unifying multiple embodied navigation tasks under free-form, multi-modal instructions, and OctoNav-R1, a VLA-based agent trained with a Hybrid Training Paradigm to generate low-level actions from 2D observations. A Think-Before-Action CoT dataset and three-stage training (Action-/TBA-SFT, Nav-GRPO, Online RL) are designed to enhance deliberative reasoning and generalization across ObjNav, PointNav, ImgNav, Ins-ImgNav, and VLN. Experiments in continuous Habitat environments show OctoNav-R1 achieving superior performance versus prior methods and demonstrate preliminary sim2real capabilities on real robots. The work emphasizes explicit thinking in navigation policies and lays groundwork for robust, generalist embodied agents.

Abstract

Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with previous methods.

OctoNav: Towards Generalist Embodied Navigation

TL;DR

This paper introduces OctoNav-Bench, a large-scale benchmark unifying multiple embodied navigation tasks under free-form, multi-modal instructions, and OctoNav-R1, a VLA-based agent trained with a Hybrid Training Paradigm to generate low-level actions from 2D observations. A Think-Before-Action CoT dataset and three-stage training (Action-/TBA-SFT, Nav-GRPO, Online RL) are designed to enhance deliberative reasoning and generalization across ObjNav, PointNav, ImgNav, Ins-ImgNav, and VLN. Experiments in continuous Habitat environments show OctoNav-R1 achieving superior performance versus prior methods and demonstrate preliminary sim2real capabilities on real robots. The work emphasizes explicit thinking in navigation policies and lays groundwork for robust, generalist embodied agents.

Abstract

Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with previous methods.

Paper Structure

This paper contains 40 sections, 25 equations, 25 figures, 8 tables.

Figures (25)

  • Figure 1: On the left, we present the large-scale OctoNav-Bench, which contains diverse instruction-trajectory pairs and the elaborate TBA-CoT dataset across numerous scenes. Based on OctoNav-Bench and our method/training designs, we introduce a VLA-based method, termed OctoNav-R1. On the right, (I) demonstrates the performance comparisons on OctoNav-Bench, where we provide a fine-grained breakdown of accuracy across various navigation capabilities. OctoNav-R1 outperforms previous methods in all capabilities, demonstrating its versatility. (II) presents a robot demo in the real world, which is driven by the OctoNav-R1, showing its preliminary sim2real generalization.
  • Figure 2: The automatic construction pipeline of OctoNav-Bench.(I) Template Generation. We generate diverse instruction templates, where multiple capabilities are involved and specific elements are represented via placeholders. (II) Trajectory Generation and Instruction Instantiation. We extract elements along the sampled trajectory and instantiate the instruction by grounding the placeholders with corresponding elements. (III) Instruction Extension. We extend instructions with their variants. (IV) Quality Check. We apply automatic and manual verification stages. Best viewed in color.
  • Figure 3: The automatic construction method of TBA-CoT. For the trajectories in OctoNav-Bench, we leverage Qwen-VL and DeepSeek-R1 to produce the thinking thoughts behind the action steps.
  • Figure 4: Overview of the HTP for training OctoNav-R1. The model takes multi-model instruction and visual observation as inputs, and produces textual answers, where model architecture details are in appendix \ref{['append:architecture']}. HTP contains three training stages, which are described in Sec. \ref{['sec:method']}.
  • Figure 5: Visualization of TBA in a trajectory.
  • ...and 20 more figures