Table of Contents
Fetching ...

Using Vision Language Models as Closed-Loop Symbolic Planners for Robotic Applications: A Control-Theoretic Perspective

Hao Wang, Sathwik Karnik, Bea Lim, Somil Bansal

TL;DR

This work examines using Vision-Language Models as closed-loop symbolic planners for robotic tasks from a control-theoretic perspective, focusing on replanning frequency (control horizon $N$) and warm-starting. Through controlled experiments across four task environments and three VLMs, it compares open-loop and closed-loop planning, revealing that closed-loop planning improves geometric success and robustness to VLM errors, while warm-starting yields substantial gains; horizon effects are generally modest when warm-starting is employed. The study delivers practical recommendations for deploying VLM-driven planners in long-horizon manipulation and highlights the importance of the underlying VLM choice. Limitations include zero-shot prompting, limited prompt engineering, and a restricted action primitive set, pointing to future work on training data effects and broader planning capabilities.

Abstract

Large Language Models (LLMs) and Vision Language Models (VLMs) have been widely used for embodied symbolic planning. Yet, how to effectively use these models for closed-loop symbolic planning remains largely unexplored. Because they operate as black boxes, LLMs and VLMs can produce unpredictable or costly errors, making their use in high-level robotic planning especially challenging. In this work, we investigate how to use VLMs as closed-loop symbolic planners for robotic applications from a control-theoretic perspective. Concretely, we study how the control horizon and warm-starting impact the performance of VLM symbolic planners. We design and conduct controlled experiments to gain insights that are broadly applicable to utilizing VLMs as closed-loop symbolic planners, and we discuss recommendations that can help improve the performance of VLM symbolic planners.

Using Vision Language Models as Closed-Loop Symbolic Planners for Robotic Applications: A Control-Theoretic Perspective

TL;DR

This work examines using Vision-Language Models as closed-loop symbolic planners for robotic tasks from a control-theoretic perspective, focusing on replanning frequency (control horizon ) and warm-starting. Through controlled experiments across four task environments and three VLMs, it compares open-loop and closed-loop planning, revealing that closed-loop planning improves geometric success and robustness to VLM errors, while warm-starting yields substantial gains; horizon effects are generally modest when warm-starting is employed. The study delivers practical recommendations for deploying VLM-driven planners in long-horizon manipulation and highlights the importance of the underlying VLM choice. Limitations include zero-shot prompting, limited prompt engineering, and a restricted action primitive set, pointing to future work on training data effects and broader planning capabilities.

Abstract

Large Language Models (LLMs) and Vision Language Models (VLMs) have been widely used for embodied symbolic planning. Yet, how to effectively use these models for closed-loop symbolic planning remains largely unexplored. Because they operate as black boxes, LLMs and VLMs can produce unpredictable or costly errors, making their use in high-level robotic planning especially challenging. In this work, we investigate how to use VLMs as closed-loop symbolic planners for robotic applications from a control-theoretic perspective. Concretely, we study how the control horizon and warm-starting impact the performance of VLM symbolic planners. We design and conduct controlled experiments to gain insights that are broadly applicable to utilizing VLMs as closed-loop symbolic planners, and we discuss recommendations that can help improve the performance of VLM symbolic planners.

Paper Structure

This paper contains 15 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Hierarchical planning and control architecture
  • Figure 2: This figure captures the initial (top) and final observation images (bottom) for sampled trajectories in the 4 environments: $\textsc{Cube-Easy}$, $\textsc{YCB-Easy}$, $\textsc{YCB-Medium}$, and $\textsc{YCB-Hard}$.
  • Figure 3: Goal Achieved Rates and Task Completion Rates for the OL and $\textsc{CL-Full}$ planner. Each bar indicates the Goal Achieved Rate, and the black dotted line indicates the Task Completion Rate.
  • Figure 4: Goal Achieved Rates and Task Completion Rates for 3 $\textsc{CL-Full}$ planner with different control horizon settings.
  • Figure 5: Goal Achieved Rate and Task Completion Rate of warm-starting and non-warm-starting variants of CL planners with 2 control horizon settings.

Theorems & Definitions (4)

  • definition 1: Open-loop Planner
  • definition 2: Closed-loop Planner
  • definition 3: Control Horizon
  • definition 4: Warm-Starting