Table of Contents
Fetching ...

CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model

Zhuoyuan Yu, Yuxing Long, Zihan Yang, Chengyan Zeng, Hongwei Fan, Jiyao Zhang, Hao Dong

TL;DR

CorrectNav introduces Self-correction Flywheel, a post-training paradigm that treats training-time navigation errors as valuable data to automatically generate self-correction samples. By detecting trajectory deviations, creating error-correcting trajectories and keyframe perception data, and iteratively retraining, the approach achieves state-of-the-art results on VLN-CE benchmarks with monocular RGB input. Real-world robot tests demonstrate robust error correction, obstacle avoidance, and long-instruction following, highlighting practical impact for embodied navigation. The work also provides a suite of navigation fine-tuning techniques and illustrates meaningful ablations and iterative improvements across flywheel cycles.

Abstract

Existing vision-and-language navigation models often deviate from the correct trajectory when executing instructions. However, these models lack effective error correction capability, hindering their recovery from errors. To address this challenge, we propose Self-correction Flywheel, a novel post-training paradigm. Instead of considering the model's error trajectories on the training set as a drawback, our paradigm emphasizes their significance as a valuable data source. We have developed a method to identify deviations in these error trajectories and devised innovative techniques to automatically generate self-correction data for perception and action. These self-correction data serve as fuel to power the model's continued training. The brilliance of our paradigm is revealed when we re-evaluate the model on the training set, uncovering new error trajectories. At this time, the self-correction flywheel begins to spin. Through multiple flywheel iterations, we progressively enhance our monocular RGB-based VLA navigation model CorrectNav. Experiments on R2R-CE and RxR-CE benchmarks show CorrectNav achieves new state-of-the-art success rates of 65.1% and 69.3%, surpassing prior best VLA navigation models by 8.2% and 16.4%. Real robot tests in various indoor and outdoor environments demonstrate \method's superior capability of error correction, dynamic obstacle avoidance, and long instruction following.

CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model

TL;DR

CorrectNav introduces Self-correction Flywheel, a post-training paradigm that treats training-time navigation errors as valuable data to automatically generate self-correction samples. By detecting trajectory deviations, creating error-correcting trajectories and keyframe perception data, and iteratively retraining, the approach achieves state-of-the-art results on VLN-CE benchmarks with monocular RGB input. Real-world robot tests demonstrate robust error correction, obstacle avoidance, and long-instruction following, highlighting practical impact for embodied navigation. The work also provides a suite of navigation fine-tuning techniques and illustrates meaningful ablations and iterative improvements across flywheel cycles.

Abstract

Existing vision-and-language navigation models often deviate from the correct trajectory when executing instructions. However, these models lack effective error correction capability, hindering their recovery from errors. To address this challenge, we propose Self-correction Flywheel, a novel post-training paradigm. Instead of considering the model's error trajectories on the training set as a drawback, our paradigm emphasizes their significance as a valuable data source. We have developed a method to identify deviations in these error trajectories and devised innovative techniques to automatically generate self-correction data for perception and action. These self-correction data serve as fuel to power the model's continued training. The brilliance of our paradigm is revealed when we re-evaluate the model on the training set, uncovering new error trajectories. At this time, the self-correction flywheel begins to spin. Through multiple flywheel iterations, we progressively enhance our monocular RGB-based VLA navigation model CorrectNav. Experiments on R2R-CE and RxR-CE benchmarks show CorrectNav achieves new state-of-the-art success rates of 65.1% and 69.3%, surpassing prior best VLA navigation models by 8.2% and 16.4%. Real robot tests in various indoor and outdoor environments demonstrate \method's superior capability of error correction, dynamic obstacle avoidance, and long instruction following.

Paper Structure

This paper contains 26 sections, 6 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Diverse Capabilities of CorrectNav. The model takes only monocular RGB video and language instructions as inputs, predicting navigation actions. Empowered by the Self-correction Flywheel post-training, CorrectNav not only maintains outstanding multimodal reasoning (Blue), but also displays improved deviation correction (Red), obstacle avoidance (Green), and complex action execution (Yellow).
  • Figure 2: The overview of CorrectNav training. CorrectNav is first finetuned on the navigation tasks (Left), including action prediction and instruction generation. To enhance vision diversity, we implement a suite of domain randomization strategies. Subsequently, CorrectNav is post-trained with our proposed Self-correction Flywheel paradigm (Right). This paradigm operates in a continuous loop of model evaluation, deviation detection, data creation, and continued training. Specifically, the data creation part can automatically collect error-correcting trajectory and keyframe perception data. Through multiple training iterations, CorrectNav can learn how to recover from deviations.
  • Figure 3: Case study about CorrectNav with and without Self-correction Flywheel post-training. Left Top: CorrectNav mistakenly enters the wrong path, loses the target, and then promptly turns back to return to the correct path. Right Top: CorrectNav first enters the front door, and after realizing there is no target (steps), it leaves and directly enters the correct side door. Vanilla CorrectNav fails in both cases.
  • Figure 4: CorrectNav's performance on R2R-CE and RxR-CE Val-Unseen splits over Self-correction Flywheel iterations.
  • Figure 5: Qualitative results from the real-world deployment of CorrectNav. (c)(d) The robot dynamically avoids pedestrians and obstacles, correctly passing through cluttered environments to reach the destination. (e)(f) The robot successfully recovers from a navigation error to complete a long-horizon instruction. (g) The robot completes outdoor long-distance navigation. Videos are shown on our project website.