Table of Contents
Fetching ...

Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments

Mohamed Elnoor, Kasun Weerakoon, Gershom Seneviratne, Jing Liang, Vignesh Rajagopal, Dinesh Manocha

TL;DR

Vi-LAD addresses the challenge of socially aware navigation in dynamic human environments by distilling reasoning from large Vision-Language Models into a lightweight transformer. It combines attention-level knowledge transfer from a vision-action backbone (VANP) and a large VLM, producing enhanced attention maps that guide a Model Predictive Control-based planner for real-time, socially compliant motion. The method introduces an attention-consistency loss and LoRA-based fine-tuning to align attentions while preserving pretrained navigation priors, enabling efficient deployment without on-device VLM queries. Real-world experiments on a Husky robot show significant improvements in success rate and human-likeness of trajectories, with inference rates suitable for real-time operation. This work demonstrates that compact models can inherit rich social reasoning from large models, enabling scalable, socially aware robotics in dynamic environments.

Abstract

We introduce Vision-Language Attention Distillation (Vi-LAD), a novel approach for distilling socially compliant navigation knowledge from a large Vision-Language Model (VLM) into a lightweight transformer model for real-time robotic navigation. Unlike traditional methods that rely on expert demonstrations or human-annotated datasets, Vi-LAD performs knowledge distillation and fine-tuning at the intermediate layer representation level (i.e., attention maps) by leveraging the backbone of a pre-trained vision-action model. These attention maps highlight key navigational regions in a given scene, which serve as implicit guidance for socially aware motion planning. Vi-LAD fine-tunes a transformer-based model using intermediate attention maps extracted from the pre-trained vision-action model, combined with attention-like semantic maps constructed from a large VLM. To achieve this, we introduce a novel attention-level distillation loss that fuses knowledge from both sources, generating augmented attention maps with enhanced social awareness. These refined attention maps are then utilized as a traversability costmap within a socially aware model predictive controller (MPC) for navigation. We validate our approach through real-world experiments on a Husky wheeled robot, demonstrating significant improvements over state-of-the-art (SOTA) navigation methods. Our results show up to 14.2% - 50% improvement in success rate, which highlights the effectiveness of Vi-LAD in enabling socially compliant and efficient robot navigation.

Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments

TL;DR

Vi-LAD addresses the challenge of socially aware navigation in dynamic human environments by distilling reasoning from large Vision-Language Models into a lightweight transformer. It combines attention-level knowledge transfer from a vision-action backbone (VANP) and a large VLM, producing enhanced attention maps that guide a Model Predictive Control-based planner for real-time, socially compliant motion. The method introduces an attention-consistency loss and LoRA-based fine-tuning to align attentions while preserving pretrained navigation priors, enabling efficient deployment without on-device VLM queries. Real-world experiments on a Husky robot show significant improvements in success rate and human-likeness of trajectories, with inference rates suitable for real-time operation. This work demonstrates that compact models can inherit rich social reasoning from large models, enabling scalable, socially aware robotics in dynamic environments.

Abstract

We introduce Vision-Language Attention Distillation (Vi-LAD), a novel approach for distilling socially compliant navigation knowledge from a large Vision-Language Model (VLM) into a lightweight transformer model for real-time robotic navigation. Unlike traditional methods that rely on expert demonstrations or human-annotated datasets, Vi-LAD performs knowledge distillation and fine-tuning at the intermediate layer representation level (i.e., attention maps) by leveraging the backbone of a pre-trained vision-action model. These attention maps highlight key navigational regions in a given scene, which serve as implicit guidance for socially aware motion planning. Vi-LAD fine-tunes a transformer-based model using intermediate attention maps extracted from the pre-trained vision-action model, combined with attention-like semantic maps constructed from a large VLM. To achieve this, we introduce a novel attention-level distillation loss that fuses knowledge from both sources, generating augmented attention maps with enhanced social awareness. These refined attention maps are then utilized as a traversability costmap within a socially aware model predictive controller (MPC) for navigation. We validate our approach through real-world experiments on a Husky wheeled robot, demonstrating significant improvements over state-of-the-art (SOTA) navigation methods. Our results show up to 14.2% - 50% improvement in success rate, which highlights the effectiveness of Vi-LAD in enabling socially compliant and efficient robot navigation.

Paper Structure

This paper contains 22 sections, 8 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Robot navigation using Vi-LAD and baseline methods in a social navigation scenario. Vi-LAD distills social navigation knowledge from a pretrained vision-action model VANP 10802451, and a large VLM, by leveraging attention maps to generate an enhanced attention representation for socially compliant navigation. This improved attention representation (LEFT BOTTOM ) enables better understanding of human intentions, allowing the robot to anticipate movement patterns and avoid potential disruptions to pedestrians.
  • Figure 2: System architecture of Vi-LAD. Our method distills social navigation knowledge from a pretrained vision-action model, VANP 10802451, and a large Vision-Language Model (VLM) by leveraging attention maps rather than performing end-to-end distillation or fine-tuning. These attention maps highlight critical regions for socially compliant navigation and are extracted from intermediate layers of the image encoders. Vi-LAD employs Structural Similarity Index Loss (SSIL) to effectively distill attention information from both VANP’s intermediate attention layers and the predictive attention maps of a large VLM, ensuring enhanced perceptual alignment for navigation.
  • Figure 3: Attention maps generated using our method Vi-LAD by distilling attention knowledge from both pretrained vision-action model VANP 10802451, and the large VLM in different social scenarios compared to the attention maps from VANP and the large VLM. These attention maps are compared against those from VANP and the large VLM. The Jet color map is applied to highlight attended regions, with red indicating the most highly attended areas. Vi-LAD demonstrates improved attention over both the pretrained model and the large VLM. By leveraging combined knowledge through attention distillation, Vi-LAD effectively corrects missed attention from both sources. This leads to enhanced focus on critical objects and regions within a scene, ensuring a more contextually aware and socially informed attention mechanism.
  • Figure 4: Robot trajectories in complex social navigation scenarios where the directional intent of agents need to be taken into account during planning. Our method identifies dynamic motion and navigational intent of agents within the environment based on the distilled attention maps which are used to plan in a improved socially compliant manner without disrupting agent motion within the environment. For example, in scenario 1, DWA and CoNVOI fail to anticipate motion, while in scenario 2, VANP and DWA exhibit the same limitation.