Table of Contents
Fetching ...

LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward

Yi Zhao, Siqi Wang, Jing Li

TL;DR

This work tackles Navigation Instruction Generation for Visually Impaired users by introducing LaF-GRPO, a GRPO-based training framework that uses an LLM-as-Follower reward to guide Vision-Language Model post-training. It introduces NIG4VI, a large open-source benchmark with multi-modal, in-situ navigation contexts, enabling realistic evaluation. Across zero-shot and supervised-finetuning settings, LaF-GRPO delivers notable gains in objective metrics and produces more intuitive, safety-aware instructions, as corroborated by human and LLM-based judges. The study demonstrates the practical impact of simulation-driven follower feedback for VI navigation and provides a resourceful benchmark to advance this domain further.

Abstract

Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study focuses on generating precise, in-situ, step-by-step navigation instructions that are practically usable for VI users. Specifically, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to navigation instructions, thereby providing feedback rewards to guide the post-training of a Vision-Language Model (VLM). This enhances instruction accuracy and usability while reducing costly real-world data collection needs. To address the scarcity of dedicated benchmarks in this field, we introduce NIG4VI, a 27k-sample open-source dataset to facilitate training and evaluation. It comprises diverse navigation scenarios with accurate spatial coordinates, supporting detailed and open-ended in-situ instruction generation. Experiments on NIG4VI demonstrate the effectiveness of LaF-GRPO through quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU 14\%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o 0.323), and qualitative analysis further confirms that our method yields more intuitive and safer instructions.

LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward

TL;DR

This work tackles Navigation Instruction Generation for Visually Impaired users by introducing LaF-GRPO, a GRPO-based training framework that uses an LLM-as-Follower reward to guide Vision-Language Model post-training. It introduces NIG4VI, a large open-source benchmark with multi-modal, in-situ navigation contexts, enabling realistic evaluation. Across zero-shot and supervised-finetuning settings, LaF-GRPO delivers notable gains in objective metrics and produces more intuitive, safety-aware instructions, as corroborated by human and LLM-based judges. The study demonstrates the practical impact of simulation-driven follower feedback for VI navigation and provides a resourceful benchmark to advance this domain further.

Abstract

Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study focuses on generating precise, in-situ, step-by-step navigation instructions that are practically usable for VI users. Specifically, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to navigation instructions, thereby providing feedback rewards to guide the post-training of a Vision-Language Model (VLM). This enhances instruction accuracy and usability while reducing costly real-world data collection needs. To address the scarcity of dedicated benchmarks in this field, we introduce NIG4VI, a 27k-sample open-source dataset to facilitate training and evaluation. It comprises diverse navigation scenarios with accurate spatial coordinates, supporting detailed and open-ended in-situ instruction generation. Experiments on NIG4VI demonstrate the effectiveness of LaF-GRPO through quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU 14\%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o 0.323), and qualitative analysis further confirms that our method yields more intuitive and safer instructions.

Paper Structure

This paper contains 14 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: NIG4VI sample. (a) Bird’s-eye map and front views. (b) Instructions generated by ASSISTER DBLP:conf/eccv/HuangSZBBO22, the ground truth, and LaF-GRPO (Ours).
  • Figure 2: Method Overview. Top left: An training sample with input, target output, and action interpretation. Bottom left: Action interpreter training using LLaMA-3-8B-Instruct to simulate VI users' navigation responses. Right: Post-training for VLMs with LaF-GRPO. The LaF reward differentiates outputs when Format and Text Generation rewards are identical.
  • Figure 3: Dataset instances are first generated using Vision-R1's modality bridging method huang2025visionr1incentivizingreasoningcapability and then reviewed and refined by human annotators.
  • Figure 4: A comparative case study of navigational guidance provided by SFT and SFT+(LaF-GRPO) methods.