Table of Contents
Fetching ...

Planning with Reasoning using Vision Language World Model

Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, Pascale Fung

TL;DR

VLWM addresses the need for high-level, semantically grounded world models by learning language-based abstractions of future states from vast natural videos. It combines a Tree of Captions compression, iterative Self-Refine plan extraction, and a dual-mode planner with a self-supervised critic to enable both reactive and reasoning-based planning. The approach achieves state-of-the-art results on Visual Planning for Assistance benchmarks, outperforms baselines on RoboVQA, and sets new performance on WorldPrediction-PP, with PlannerArena evaluations favoring its plans. By grounding world modeling in language and leveraging offline data, VLWM provides a scalable bridge between perception, reasoning, and long-horizon decision making, with open-source data and models to facilitate further research.

Abstract

Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.

Planning with Reasoning using Vision Language World Model

TL;DR

VLWM addresses the need for high-level, semantically grounded world models by learning language-based abstractions of future states from vast natural videos. It combines a Tree of Captions compression, iterative Self-Refine plan extraction, and a dual-mode planner with a self-supervised critic to enable both reactive and reasoning-based planning. The approach achieves state-of-the-art results on Visual Planning for Assistance benchmarks, outperforms baselines on RoboVQA, and sets new performance on WorldPrediction-PP, with PlannerArena evaluations favoring its plans. By grounding world modeling in language and leveraging offline data, VLWM provides a scalable bridge between perception, reasoning, and long-horizon decision making, with open-source data and models to facilitate further research.

Abstract

Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.

Paper Structure

This paper contains 35 sections, 2 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Example of a VLWM action-state trajectory given a video observation and a goal. VLWM can either generate a plan using one roll-out (system-1), or search over multiple actions by inferring the new world states and minimizing a given cost function (system-2).
  • Figure 2: Overview of VLWM. (a) VLWM is a JEPA-style world model that predict abstract representation of future world states, instead of generating noisy and high-volume raw observations. (b) Given video contexts, VLWM's prediction target is a structured textual representation of the unobserved future. It includes goal and interleaved action ($A$) world state changes ($\Delta S$), all extracted automatically. (c) VLWM can infer possible goals from the context, and interpret them with current initial state and the expected final state. It supports both fast reactive system-1 plan generation and reflective system-2 reasoning based on cost minimization.
  • Figure 3: System-2 planning of VLWM. (a): the critic is trained in a self-supervised manner, assigning lower cost to valid progress, while assigning higher cost for adding irrelevant distractors or shuffling the steps. (b): VLWM generates candidate action sequences and simulates their future state transitions. A critic evaluates the resulting state trajectories given the goal, and the planner selects the lowest-cost plan.
  • Figure 4: Illustration of PlannerArena annotation interface.
  • Figure 5: Cost curves estimated by different critic models. Each plot visualizes 3k cost curves on goal achievement detection trajectories, where each trajectory is composed of a reference gold plan (0%-100%) and distractor steps (100%-200%). Red dots ($\boldsymbol{\cdot}$) mark cost-minimizing steps (detected goal achievement points). VLWM-Critic accurately detects goal completion around 100% plan length, while baselines show suboptimal or noisy behavior.
  • ...and 5 more figures