Table of Contents
Fetching ...

E-SocialNav: Efficient Socially Compliant Navigation with Language Models

Ling Xiao, Daeun Song, Xuesu Xiao, Toshihiko Yamasaki

Abstract

Language models (LMs) are increasingly applied to robotic navigation; however, existing benchmarks primarily emphasize navigation success rates while paying limited attention to social compliance. Moreover, relying on large-scale LMs can raise efficiency concerns, as their heavy computational overhead leads to slower response times and higher energy consumption, making them impractical for real-time deployment on resource-constrained robotic platforms. In this work, we evaluate the social compliance of GPT-4o and Claude in robotic navigation and propose E-SocialNav, an efficient LM designed for socially compliant navigation. Despite being trained on a relatively small dataset, E-SocialNav consistently outperforms zero-shot baselines in generating socially compliant behaviors. By employing a two-stage training pipeline consisting of supervised fine-tuning followed by direct preference optimization, E-SocialNav achieves strong performance in both text-level semantic similarity to human annotations and action accuracy. The source code is available at https://github.com/Dr-LingXiao/ESocialNav.

E-SocialNav: Efficient Socially Compliant Navigation with Language Models

Abstract

Language models (LMs) are increasingly applied to robotic navigation; however, existing benchmarks primarily emphasize navigation success rates while paying limited attention to social compliance. Moreover, relying on large-scale LMs can raise efficiency concerns, as their heavy computational overhead leads to slower response times and higher energy consumption, making them impractical for real-time deployment on resource-constrained robotic platforms. In this work, we evaluate the social compliance of GPT-4o and Claude in robotic navigation and propose E-SocialNav, an efficient LM designed for socially compliant navigation. Despite being trained on a relatively small dataset, E-SocialNav consistently outperforms zero-shot baselines in generating socially compliant behaviors. By employing a two-stage training pipeline consisting of supervised fine-tuning followed by direct preference optimization, E-SocialNav achieves strong performance in both text-level semantic similarity to human annotations and action accuracy. The source code is available at https://github.com/Dr-LingXiao/ESocialNav.
Paper Structure (9 sections, 4 equations, 5 figures, 2 tables)

This paper contains 9 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The detailed structure of E-SocialNav. E-SocialNav is trained in two phases: SFT on multi-turn dialogues, followed by DPO on single-turn pairs. During SFT, only the projector is updated; during DPO, only the LoRA adapter is updated.
  • Figure 2: Visualization of constructed DPO training pairs. The chosen response is annotated by humans, whereas the rejected response is generated by modifying certain facts in the chosen response.
  • Figure 3: Visualizations: E-SocialNav accurately captures social-compliance cues from the image.
  • Figure 4: Visualization of failure cases. Gt: Ground truth.
  • Figure :