Table of Contents
Fetching ...

VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models

Daeun Song, Jing Liang, Amirreza Payandeh, Amir Hossain Raj, Xuesu Xiao, Dinesh Manocha

TL;DR

VLM-Social-Nav is proposed, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's motion in human-centered environments that results in improved socially compliant navigation in human-shared environments.

Abstract

We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's motion in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior. VLM-Social-Nav uses a VLM-based scoring module that computes a cost term that ensures socially appropriate and effective robot actions generated by the underlying planner. Our overall approach reduces reliance on large training datasets and enhances adaptability in decision-making. In practice, it results in improved socially compliant navigation in human-shared environments. We demonstrate and evaluate our system in four different real-world social navigation scenarios with a Turtlebot robot. We observe at least 27.38% improvement in the average success rate and 19.05% improvement in the average collision rate in the four social navigation scenarios. Our user study score shows that VLM-Social-Nav generates the most socially compliant navigation behavior.

VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models

TL;DR

VLM-Social-Nav is proposed, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's motion in human-centered environments that results in improved socially compliant navigation in human-shared environments.

Abstract

We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's motion in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior. VLM-Social-Nav uses a VLM-based scoring module that computes a cost term that ensures socially appropriate and effective robot actions generated by the underlying planner. Our overall approach reduces reliance on large training datasets and enhances adaptability in decision-making. In practice, it results in improved socially compliant navigation in human-shared environments. We demonstrate and evaluate our system in four different real-world social navigation scenarios with a Turtlebot robot. We observe at least 27.38% improvement in the average success rate and 19.05% improvement in the average collision rate in the four social navigation scenarios. Our user study score shows that VLM-Social-Nav generates the most socially compliant navigation behavior.
Paper Structure (16 sections, 4 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 4 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The trajectories of VLM-Social-Nav (red), DWA (blue), and BC (yellow) approaches in the frontal encountering scenario (left) and the intersection scenario (right). The resulting trajectories show that VLM-Social-Nav demonstrates more socially compliant behavior because it is instructed by a prompt.
  • Figure 2: The overall system architecture of VLM-Social-Nav. Our real-world perception (vision) model detects important social entities ( e.g., humans, gestures, and doors) in real time and prompts the VLM-based scoring module to compute social cost $\mathcal{C}_\textrm{social}$, which is used to generate socially compliant robot action.
  • Figure 3: An example input image ($\mathcal{I}^t$) and prompt ($\mathcal{P}$) used in VLM-Social-Nav. Parameterized inputs ($\mathbf{a}^t$) are highlighted in blue. Formatted outputs specifying the heading ($\delta_d$) and the speed ($\delta_s$) are highlighted in red. The example input data is one of the frontal approach scenarios from MuSoHu nguyenmusohu.
  • Figure 4: Qualitative Results: the robot navigation behaviors with VLM-Social-Nav for four social navigation scenarios: (a) Frontal Approach, (b) Frontal Approach with Gesture, (c) Intersection, and (d) Narrow Doorway. The solid gray arrow shows the participant's path. The solid red arrow shows the robot's path. The red and gray dashed arrows show the robot's and participant's paths respectively, after a stop motion. A caption on the top left shows the result from the VLM.
  • Figure 5: User Study Average Scores: the per-question average scores for the three methods in each scenario. The results indicate that VLM-Social-Nav earned the highest level of agreement from participants across all questions, highlighting its robust alignment with social norms.