Table of Contents
Fetching ...

Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models

Annie S. Chen, Alec M. Lessing, Andy Tang, Govind Chada, Laura Smith, Sergey Levine, Chelsea Finn

TL;DR

This work tackles robust, autonomous legged locomotion in unstructured environments by leveraging pre-trained vision-language models (VLMs). The proposed VLM-PC framework uses in-context reasoning over the robot’s interaction history and model-predictive–style multi-step planning to generate and execute high-level skill plans in a language-grounded interface, enabling adaptive behavior without task-specific training. Empirical results on a Go1 quadruped across five real-world obstacle courses show that VLM-PC substantially increases success rates and reduces completion times compared to baselines, with further gains when providing in-context labeled examples. Overall, the approach demonstrates a practical pathway for integrating multimodal foundation models into autonomous robotics, reducing the need for environment-specific engineering guidance.

Abstract

Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions. For example, in a search and rescue mission, a legged robot could climb over debris, crawl through gaps, and navigate out of dead ends. However, the robot's controller needs to respond intelligently to such varied obstacles, and this requires handling unexpected and unusual scenarios successfully. This presents an open challenge to current learning methods, which often struggle with generalization to the long tail of unexpected situations without heavy human supervision. To address this issue, we investigate how to leverage the broad knowledge about the structure of the world and commonsense reasoning capabilities of vision-language models (VLMs) to aid legged robots in handling difficult, ambiguous situations. We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection with VLMs: (1) in-context adaptation over previous robot interactions and (2) planning multiple skills into the future and replanning. We evaluate VLM-PC on several challenging real-world obstacle courses, involving dead ends and climbing and crawling, on a Go1 quadruped robot. Our experiments show that by reasoning over the history of interactions and future plans, VLMs enable the robot to autonomously perceive, navigate, and act in a wide range of complex scenarios that would otherwise require environment-specific engineering or human guidance.

Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models

TL;DR

This work tackles robust, autonomous legged locomotion in unstructured environments by leveraging pre-trained vision-language models (VLMs). The proposed VLM-PC framework uses in-context reasoning over the robot’s interaction history and model-predictive–style multi-step planning to generate and execute high-level skill plans in a language-grounded interface, enabling adaptive behavior without task-specific training. Empirical results on a Go1 quadruped across five real-world obstacle courses show that VLM-PC substantially increases success rates and reduces completion times compared to baselines, with further gains when providing in-context labeled examples. Overall, the approach demonstrates a practical pathway for integrating multimodal foundation models into autonomous robotics, reducing the need for environment-specific engineering guidance.

Abstract

Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions. For example, in a search and rescue mission, a legged robot could climb over debris, crawl through gaps, and navigate out of dead ends. However, the robot's controller needs to respond intelligently to such varied obstacles, and this requires handling unexpected and unusual scenarios successfully. This presents an open challenge to current learning methods, which often struggle with generalization to the long tail of unexpected situations without heavy human supervision. To address this issue, we investigate how to leverage the broad knowledge about the structure of the world and commonsense reasoning capabilities of vision-language models (VLMs) to aid legged robots in handling difficult, ambiguous situations. We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection with VLMs: (1) in-context adaptation over previous robot interactions and (2) planning multiple skills into the future and replanning. We evaluate VLM-PC on several challenging real-world obstacle courses, involving dead ends and climbing and crawling, on a Go1 quadruped robot. Our experiments show that by reasoning over the history of interactions and future plans, VLMs enable the robot to autonomously perceive, navigate, and act in a wide range of complex scenarios that would otherwise require environment-specific engineering or human guidance.
Paper Structure (18 sections, 6 figures, 3 tables)

This paper contains 18 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Vision-Language Model Predictive Control (VLM-PC) enables real-world locomotion adaptation. By leveraging the commonsense reasoning abilities of pre-trained VLMs to adaptively select behaviors, VLM-PC allows legged robots to quickly adjust strategies when encountering a wide range of situations, even backtracking when appropriate. Center: An example trajectory of the robot tasked with finding the red chew toy amid obstacles using VLM-PC--it first crawls under a couch, then backs out of it when it finds it is a dead end, turns to walk around the couch, climbs over a sizeable cushion, and finally locates the toy. Bottom left: An overhead view of the trajectory with VLM-PC. Bottom right: An example trajectory of the robot's behavior using a VLM naively, where the robot gets stuck and cannot adapt. Left and right: Visualization of the robot's egocentric POV that is provided to the VLM at different points along the trial along with excerpts of reasoning with VLM-PC at those points.
  • Figure 2: Vision-Language Model Predictive Control (VLM-PC). Our method uses a pre-trained VLM to provide high-level skill commands for a legged robot to execute. Given the robot's current view and history of interactions, the VLM is first prompted to reason through the robot's current state and progress with the history of commanded skills, and is then prompted to make a new multi-step plan, compare it to the prior plan, and adjust if needed. The robot executes the first skill in the plan, and the VLM is queried again.
  • Figure 3: Deployment Environments. We evaluate VLM-PC on five challenging real-world settings, each of which presents unseen obstacles designed for the robot to get stuck, and requires commonsense reasoning to solve. For each setting, we give a third-person view of the obstacle course as well as an example path through the course, with three different egocentric views (labeled 1, 2, 3) at different points to show the diversity of scenes the robot encounters from its viewpoint.
  • Figure 4: Main Results Averaged Across Settings. Across all five settings on average, VLM-PC significantly outperforms Random, No History, and No Multi-Step on average and median time to complete the task and success rate, performing roughly 30% more successfully than the next best method.
  • Figure 5: Typical VLM Interactions. With VLM-PC (Right), the VLM can both analyze the efficacy of previous commands and prepare new, coherent plans to tackle the current obstacle, by combining benefits from multi-step planning from No History (Left) and reasoning over history from No Multi-Step (Center).
  • ...and 1 more figures