Table of Contents
Fetching ...

ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

Zhuoyang Zhang, Shang Yang, Qinghao Hu, Luke J. Huang, James Hou, Yufei Sun, Yao Lu, Song Han

TL;DR

ForeAct tackles open-world robotic manipulation by introducing a visual foresight planner that guides VLA models with imagined future observations and subtasks. It achieves efficient, high-resolution future predictions and leverages a VLM to reason about subtasks, enabling robust, closed-loop control. The method, pretrained on over 1 million subtasks and evaluated on 11 real-world tasks, delivers substantial gains over strong baselines and demonstrates strong OOD generalization and data efficiency. The approach is compatible with existing VLA models via simple visual-input augmentation, making it practical for real-world deployment.

Abstract

Vision-Language-Action (VLA) models convert high-level language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visual Foresight Planning (ForeAct), a general and efficient planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuo-motor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image generation module that predicts a high-quality 640$\times$480 future observation from the current visual input and language instruction within only 0.33s on an H100 GPU, together with a vision-language model that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on over 1 million multi-task, cross-embodiment episodes, enabling it to learn robust embodied dynamics. We evaluate our framework on a benchmark that consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4%, demonstrating a +40.9% absolute improvement over the $π_0$ baseline (46.5%) and a +30.3% absolute improvement over $π_0$ augmented with textual subtask guidance (57.1%).

ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

TL;DR

ForeAct tackles open-world robotic manipulation by introducing a visual foresight planner that guides VLA models with imagined future observations and subtasks. It achieves efficient, high-resolution future predictions and leverages a VLM to reason about subtasks, enabling robust, closed-loop control. The method, pretrained on over 1 million subtasks and evaluated on 11 real-world tasks, delivers substantial gains over strong baselines and demonstrates strong OOD generalization and data efficiency. The approach is compatible with existing VLA models via simple visual-input augmentation, making it practical for real-world deployment.

Abstract

Vision-Language-Action (VLA) models convert high-level language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visual Foresight Planning (ForeAct), a general and efficient planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuo-motor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image generation module that predicts a high-quality 640480 future observation from the current visual input and language instruction within only 0.33s on an H100 GPU, together with a vision-language model that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on over 1 million multi-task, cross-embodiment episodes, enabling it to learn robust embodied dynamics. We evaluate our framework on a benchmark that consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4%, demonstrating a +40.9% absolute improvement over the baseline (46.5%) and a +30.3% absolute improvement over augmented with textual subtask guidance (57.1%).
Paper Structure (23 sections, 3 equations, 13 figures, 7 tables)

This paper contains 23 sections, 3 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Overview of our ForeAct framework. The VLM-based subtask planner takes the robot's head-camera observation and generates a subtask instruction for the Foresight Image Generation (ImGen) module. ImGen then predicts the future observation, which is fed into the VLA model together with the subtask instruction and the robot's three camera views. These modules operate jointly to enable closed-loop control.
  • Figure 2: Foresight image generation model. The current observation is first encoded into compact visual tokens and concatenated with a noise input, then fed into the efficient linear DiT model to generate the predicted visual tokens, which are subsequently decoded into the future observation. In addition, the instruction and a specially designed system prompt are incorporated to guide the model's attention toward the robot's actions.
  • Figure 3: (a) Number of subtasks from each dataset. Our data comes from a wide range of sources, and preprocessing yields a total of 1.16 million subtasks. (b) Diverse robot embodiments in the pre-training dataset. The collected dataset covers a wide range of robot embodiments.
  • Figure 4: Examples of tasks in our real-world dataset.
  • Figure 5: Qualitative results of foresight image generation. The first row shows the model without pretraining and the second row shows the model with pretraining. The task is to pick up the corn. We generate four images for each model with different random seeds.
  • ...and 8 more figures