E-SocialNav: Efficient Socially Compliant Navigation with Language Models

Ling Xiao; Daeun Song; Xuesu Xiao; Toshihiko Yamasaki

E-SocialNav: Efficient Socially Compliant Navigation with Language Models

Ling Xiao, Daeun Song, Xuesu Xiao, Toshihiko Yamasaki

Abstract

Language models (LMs) are increasingly applied to robotic navigation; however, existing benchmarks primarily emphasize navigation success rates while paying limited attention to social compliance. Moreover, relying on large-scale LMs can raise efficiency concerns, as their heavy computational overhead leads to slower response times and higher energy consumption, making them impractical for real-time deployment on resource-constrained robotic platforms. In this work, we evaluate the social compliance of GPT-4o and Claude in robotic navigation and propose E-SocialNav, an efficient LM designed for socially compliant navigation. Despite being trained on a relatively small dataset, E-SocialNav consistently outperforms zero-shot baselines in generating socially compliant behaviors. By employing a two-stage training pipeline consisting of supervised fine-tuning followed by direct preference optimization, E-SocialNav achieves strong performance in both text-level semantic similarity to human annotations and action accuracy. The source code is available at https://github.com/Dr-LingXiao/ESocialNav.

E-SocialNav: Efficient Socially Compliant Navigation with Language Models

Abstract

Paper Structure (9 sections, 4 equations, 5 figures, 2 tables)

This paper contains 9 sections, 4 equations, 5 figures, 2 tables.

Introduction
Related Work
Social Robot Navigation
Small Language Models
Methods
Experiments
Experimental Settings
Experimental Results
Conclusions

Figures (5)

Figure 1: The detailed structure of E-SocialNav. E-SocialNav is trained in two phases: SFT on multi-turn dialogues, followed by DPO on single-turn pairs. During SFT, only the projector is updated; during DPO, only the LoRA adapter is updated.
Figure 2: Visualization of constructed DPO training pairs. The chosen response is annotated by humans, whereas the rejected response is generated by modifying certain facts in the chosen response.
Figure 3: Visualizations: E-SocialNav accurately captures social-compliance cues from the image.
Figure 4: Visualization of failure cases. Gt: Ground truth.
Figure :

E-SocialNav: Efficient Socially Compliant Navigation with Language Models

Abstract

E-SocialNav: Efficient Socially Compliant Navigation with Language Models

Authors

Abstract

Table of Contents

Figures (5)