Table of Contents
Fetching ...

LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating

Junting Chen, Yunchuan Li, Panfeng Jiang, Jiacheng Du, Zixuan Chen, Chenrui Tie, Jiajun Deng, Lin Shao

TL;DR

The paper tackles language-guided social navigation by introducing LISN-Bench, a simulation benchmark built on Arena 3.0, and a fast-slow control framework, Social-Nav-Modulator, that couples a slow VLM-based reasoning module with a fast reactive planner and dynamic costmap. By translating high-level semantic meaning into planner parameters and cost-field adjustments, the approach achieves superior success rates and semantic compliance in dynamic social environments compared to two VLM-based baselines. The findings highlight the practicality of decoupling semantic reasoning from real-time control to enable human-aware robot navigation and point to future work on broader social norms and real-world validation. Overall, this work advances standardized evaluation and robust, instruction-following social navigation in realistic simulations with potential real-world impact.

Abstract

Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/

LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating

TL;DR

The paper tackles language-guided social navigation by introducing LISN-Bench, a simulation benchmark built on Arena 3.0, and a fast-slow control framework, Social-Nav-Modulator, that couples a slow VLM-based reasoning module with a fast reactive planner and dynamic costmap. By translating high-level semantic meaning into planner parameters and cost-field adjustments, the approach achieves superior success rates and semantic compliance in dynamic social environments compared to two VLM-based baselines. The findings highlight the practicality of decoupling semantic reasoning from real-time control to enable human-aware robot navigation and point to future work on broader social norms and real-world validation. Overall, this work advances standardized evaluation and robust, instruction-following social navigation in realistic simulations with potential real-world impact.

Abstract

Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/

Paper Structure

This paper contains 22 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The proposed Social-Nav-Modulator overview. The slow-loop VLM reasoner takes in visual data and a language instruction, outputting adjustments for the fast-loop Social Force Model (SFM) controller parameters and the value map states, which together generate real-time control commands that comply with the social rules and instructions.
  • Figure 2: The figure demonstrates the annotations required for LISN task evalution on the hospital asset in Arena 3.0kastner_arena_2024. Sub-figure a) depicts the semantic region annotations, represented by a list of region masks with a semantic ID. Sub-figure b) demonstrates the pedestrian annotations in an episode, where a pedestrian assigned with a specific identity also has a pre-defined movement trajectory in the environment. We also add extra mesh models to the original assets to provide the identity-corresponding mesh in the simulation, such as Doctor.
  • Figure 3: The proposed Social-Nav-Modulator architecture. The slow-loop VLM reasoning agent takes in visual data and a language instruction, putting adjustments for the fast-loop reactive controller and the social costmap layer, which together generate real-time control commands.
  • Figure 4: Qualitative Evaluation. This figure explains how our method improves performance compared with the two baselines. In the upper row, our method successfully tracks the moving doctor while two other methods fail due to slow VLM inference and then lose track of the doctor from RGB observation. In the lower row, our method attends to ground line markings much better than the two baselines and abides by the social norm with maximum effort, thanks to extra perception tools.