LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies
I Made Aswin Nahrendra, Seunghyun Lee, Dongkyu Lee, Hyun Myung
TL;DR
LocoVLM addresses the limitation of geometry-centric legged locomotion by grounding vision-language semantics into executable locomotion skills. It combines an offline data-distillation pipeline from an LLM with a vision-language grounding module and a style-conditioned controller, enabling real-time adaptation to high-level instructions without online LLM queries. The approach leverages mixed-precision retrieval and text-as-image representations to achieve robust, instruction-grounded motion with strong zero-shot generalization across embodiments. Experimental results show improved gait tracking, scalable data generation, and semantically aware behavior in diverse terrains, highlighting practical impact for interactive, semantically guided legged robotics. The work establishes a scalable framework for integrating foundation models into real-time locomotion, with potential extensions to multimodal scene understanding and tighter navigation-semantic coupling.
Abstract
Recent advances in legged locomotion learning are still dominated by the utilization of geometric representations of the environment, limiting the robot's capability to respond to higher-level semantics such as human instructions. To address this limitation, we propose a novel approach that integrates high-level commonsense reasoning from foundation models into the process of legged locomotion adaptation. Specifically, our method utilizes a pre-trained large language model to synthesize an instruction-grounded skill database tailored for legged robots. A pre-trained vision-language model is employed to extract high-level environmental semantics and ground them within the skill database, enabling real-time skill advisories for the robot. To facilitate versatile skill control, we train a style-conditioned policy capable of generating diverse and robust locomotion skills with high fidelity to specified styles. To the best of our knowledge, this is the first work to demonstrate real-time adaptation of legged locomotion using high-level reasoning from environmental semantics and instructions with instruction-following accuracy of up to 87% without the need for online query to on-the-cloud foundation models.
