Table of Contents
Fetching ...

From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection

Zilin Fang, Anxing Xiao, David Hsu, Gim Hee Lee

TL;DR

The paper proposes a social robot navigation framework that blends geometric path planning with context-aware social reasoning using a task-specific Vision-Language Model. It samples geometry-feasible paths and uses a fine-tuned VLM, distilling the reasoning into a compact model (Qwen-2.5 7B) for real-time path selection, within a receding-horizon loop that feeds back to a local ORCA-based controller. Experiments on a Boston Dynamics Spot platform across four social scenarios show superior performance, achieving collision-free trajectories with minimal social-zone intrusion and low personal-space violations compared with multiple baselines. The work demonstrates that grounding social norms in a VLM, combined with motion prediction and anchors-based planning, yields robust, scalable social navigation in diverse human-centered contexts.

Abstract

Navigating socially in human environments requires more than satisfying geometric constraints, as collision-free paths may still interfere with ongoing activities or conflict with social norms. Addressing this challenge calls for analyzing interactions between agents and incorporating common-sense reasoning into planning. This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning. The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths, informed by contextually grounded social expectations, selecting a socially optimized path for the controller. This task-specific VLM distills social reasoning from large foundation models into a smaller and efficient model, allowing the framework to perform real-time adaptation in diverse human-robot interaction contexts. Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions. Project page: https://path-etiquette.github.io

From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection

TL;DR

The paper proposes a social robot navigation framework that blends geometric path planning with context-aware social reasoning using a task-specific Vision-Language Model. It samples geometry-feasible paths and uses a fine-tuned VLM, distilling the reasoning into a compact model (Qwen-2.5 7B) for real-time path selection, within a receding-horizon loop that feeds back to a local ORCA-based controller. Experiments on a Boston Dynamics Spot platform across four social scenarios show superior performance, achieving collision-free trajectories with minimal social-zone intrusion and low personal-space violations compared with multiple baselines. The work demonstrates that grounding social norms in a VLM, combined with motion prediction and anchors-based planning, yields robust, scalable social navigation in diverse human-centered contexts.

Abstract

Navigating socially in human environments requires more than satisfying geometric constraints, as collision-free paths may still interfere with ongoing activities or conflict with social norms. Addressing this challenge calls for analyzing interactions between agents and incorporating common-sense reasoning into planning. This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning. The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths, informed by contextually grounded social expectations, selecting a socially optimized path for the controller. This task-specific VLM distills social reasoning from large foundation models into a smaller and efficient model, allowing the framework to perform real-time adaptation in diverse human-robot interaction contexts. Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions. Project page: https://path-etiquette.github.io
Paper Structure (24 sections, 1 equation, 11 figures, 2 tables)

This paper contains 24 sections, 1 equation, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Illustration of robot navigation in a scenario with three geometrically feasible sampled paths, where the robot should reason about social conventions to select the most appropriate path.
  • Figure 2: System overview. Geometric constraints are extracted from human motion and costmap modules using sensor data. Collision-free path candidates are sampled, projected into the image, and evaluated by a fine-tuned VLM. The selection is fed back as reference to retrieve a path for the local controller.
  • Figure 3: Human Motion Extraction. The module detects and tracks humans using images, then fuses depth information from LiDAR point clouds with ego-pose from odometry to estimate human states in global coordinates.
  • Figure 4: The illustration of Prediction-Fused Costmap Generation.
  • Figure 5: Path Planning. Detoured yet collision-free path candidates are mainly obtained through the use of anchors.
  • ...and 6 more figures