LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

I Made Aswin Nahrendra; Seunghyun Lee; Dongkyu Lee; Hyun Myung

LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

I Made Aswin Nahrendra, Seunghyun Lee, Dongkyu Lee, Hyun Myung

TL;DR

LocoVLM addresses the limitation of geometry-centric legged locomotion by grounding vision-language semantics into executable locomotion skills. It combines an offline data-distillation pipeline from an LLM with a vision-language grounding module and a style-conditioned controller, enabling real-time adaptation to high-level instructions without online LLM queries. The approach leverages mixed-precision retrieval and text-as-image representations to achieve robust, instruction-grounded motion with strong zero-shot generalization across embodiments. Experimental results show improved gait tracking, scalable data generation, and semantically aware behavior in diverse terrains, highlighting practical impact for interactive, semantically guided legged robotics. The work establishes a scalable framework for integrating foundation models into real-time locomotion, with potential extensions to multimodal scene understanding and tighter navigation-semantic coupling.

Abstract

Recent advances in legged locomotion learning are still dominated by the utilization of geometric representations of the environment, limiting the robot's capability to respond to higher-level semantics such as human instructions. To address this limitation, we propose a novel approach that integrates high-level commonsense reasoning from foundation models into the process of legged locomotion adaptation. Specifically, our method utilizes a pre-trained large language model to synthesize an instruction-grounded skill database tailored for legged robots. A pre-trained vision-language model is employed to extract high-level environmental semantics and ground them within the skill database, enabling real-time skill advisories for the robot. To facilitate versatile skill control, we train a style-conditioned policy capable of generating diverse and robust locomotion skills with high fidelity to specified styles. To the best of our knowledge, this is the first work to demonstrate real-time adaptation of legged locomotion using high-level reasoning from environmental semantics and instructions with instruction-following accuracy of up to 87% without the need for online query to on-the-cloud foundation models.

LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

TL;DR

Abstract

Paper Structure (40 sections, 8 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 8 equations, 12 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Locomotion Skill Control
LLM as Robot Policies
LLM as a Data Generator
Methodology
Versatile Quadrupedal Locomotion
Style-Conditioned Locomotion Policy
Compliant Contact Tracking
Scaling Up Motion Description Data
Instruction Description Generation
Instruction-Grounded Motion Description
Prompted Reasoning for Motion Description Generation
Vision-Language Model as a Motion Advisor
Mixed-Precision Retrieval
...and 25 more sections

Figures (12)

Figure 1: LocoVLM receives vision-language instructions as its input and hierarchically grounds them into versatile locomotion skills. (a) An LLM is used to scale up motion descriptor data generation and store it into a skill database. (b) During inference, a VLM retrieves the most relevant motion descriptor from the database using the proposed mixed-precision retrieval mechanism to give a style reference to the locomotion controller. (c) Finally, a pre-trained style-conditioned locomotion controller executes the robot’s motion to realize the given vision-language instructions.
Figure 2: Gait phase encoding for a cycle duration of $T\!=\!1$. The gait phase encoding vector is a two-dimensional representation of the current phase in the gait cycle.
Figure 3: Offline skill database generation pipeline. The LLM firstly generates instructions, which are categorized into mimicking behaviors, scene responses, and direct instructions. These instructions are then passed to a meta-prompt to generate contents for the skill database.
Figure 4: Skill database retrieval process. The instruction query is encoded by a VLM into a learned embedding space. A VLM is used instead of a sentence encoder to enable multimodal retrieval from text or image inputs. The closest instruction embedding in the database is retrieved to obtain the corresponding motion descriptor.
Figure 5: Gait tracking performance of the locomotion policy. The top row shows the foot contact states, while the bottom row displays snapshots of the robot. The robot accurately follows the desired gait patterns for various gaits, namely: (a) pronk, (b) trot, (c) pace, (d) bound, and (e) rotary gallop.
...and 7 more figures

LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

TL;DR

Abstract

LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (12)