Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments
Mohamed Elnoor, Kasun Weerakoon, Gershom Seneviratne, Jing Liang, Vignesh Rajagopal, Dinesh Manocha
TL;DR
Vi-LAD addresses the challenge of socially aware navigation in dynamic human environments by distilling reasoning from large Vision-Language Models into a lightweight transformer. It combines attention-level knowledge transfer from a vision-action backbone (VANP) and a large VLM, producing enhanced attention maps that guide a Model Predictive Control-based planner for real-time, socially compliant motion. The method introduces an attention-consistency loss and LoRA-based fine-tuning to align attentions while preserving pretrained navigation priors, enabling efficient deployment without on-device VLM queries. Real-world experiments on a Husky robot show significant improvements in success rate and human-likeness of trajectories, with inference rates suitable for real-time operation. This work demonstrates that compact models can inherit rich social reasoning from large models, enabling scalable, socially aware robotics in dynamic environments.
Abstract
We introduce Vision-Language Attention Distillation (Vi-LAD), a novel approach for distilling socially compliant navigation knowledge from a large Vision-Language Model (VLM) into a lightweight transformer model for real-time robotic navigation. Unlike traditional methods that rely on expert demonstrations or human-annotated datasets, Vi-LAD performs knowledge distillation and fine-tuning at the intermediate layer representation level (i.e., attention maps) by leveraging the backbone of a pre-trained vision-action model. These attention maps highlight key navigational regions in a given scene, which serve as implicit guidance for socially aware motion planning. Vi-LAD fine-tunes a transformer-based model using intermediate attention maps extracted from the pre-trained vision-action model, combined with attention-like semantic maps constructed from a large VLM. To achieve this, we introduce a novel attention-level distillation loss that fuses knowledge from both sources, generating augmented attention maps with enhanced social awareness. These refined attention maps are then utilized as a traversability costmap within a socially aware model predictive controller (MPC) for navigation. We validate our approach through real-world experiments on a Husky wheeled robot, demonstrating significant improvements over state-of-the-art (SOTA) navigation methods. Our results show up to 14.2% - 50% improvement in success rate, which highlights the effectiveness of Vi-LAD in enabling socially compliant and efficient robot navigation.
