Table of Contents
Fetching ...

FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models

Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, Haoang Li

TL;DR

FlowVLA tackles the lack of explicit motion reasoning in Vision-Language-Action models by introducing Visual Chain of Thought (Visual CoT), which decomposes prediction into motion reasoning via optical flow f_t followed by appearance generation to produce v_{t+1}. Implemented as a two-stage training pipeline, FlowVLA pre-trains a world model with a unified appearance-motion tokenization (v_t and f_t) in an interleaved v_t -> f_t -> v_{t+1} sequence, then finetunes for action prediction using discretized action tokens. Across LIBERO, SimplerEnv, and real-robot AgileX, FlowVLA achieves state-of-the-art results and substantially improved sample efficiency, with ablations confirming the critical role of Visual CoT, flow supervision, and interleaved causal sequencing. The results demonstrate that motion-first world modeling yields more physically plausible forecasts, better language-grounding, and stronger transfer from simulation to real-world robotic manipulation.

Abstract

Many Vision-Language-Action (VLA) models are built upon an internal world model trained via next-frame prediction ``$v_t \rightarrow v_{t+1}$''. However, this paradigm attempts to predict the future frame's appearance directly, without explicitly reasoning about the underlying dynamics. \textbf{This lack of an explicit motion reasoning step} often leads to physically implausible visual forecasts and inefficient policy learning. To address this limitation, we introduce the \textbf{Visual Chain of Thought (Visual CoT)}, a paradigm that compels the model to first reason about \textbf{motion dynamics} before generating the future frame. We instantiate this paradigm by proposing \textbf{FlowVLA}, an autoregressive Transformer that explicitly materializes this reasoning process as ``$v_t \rightarrow f_t \rightarrow v_{t+1}$'', where $f_t$ is an intermediate optical flow prediction that inherently encodes motion. By forcing the model to first follow the motion plan encoded by $f_t$, this process inherently \textbf{aligns the pre-training objective of dynamics prediction with the downstream task of action generation.} We conduct experiments on challenging robotics manipulation benchmarks, as well as real-robot evaluations. Our FlowVLA not only generates \textbf{more coherent and physically plausible visual predictions}, but also achieves state-of-the-art policy performance with \textbf{substantially improved sample efficiency}, pointing toward a more principled foundation for world modeling in VLAs. Project page: https://irpn-lab.github.io/FlowVLA/

FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models

TL;DR

FlowVLA tackles the lack of explicit motion reasoning in Vision-Language-Action models by introducing Visual Chain of Thought (Visual CoT), which decomposes prediction into motion reasoning via optical flow f_t followed by appearance generation to produce v_{t+1}. Implemented as a two-stage training pipeline, FlowVLA pre-trains a world model with a unified appearance-motion tokenization (v_t and f_t) in an interleaved v_t -> f_t -> v_{t+1} sequence, then finetunes for action prediction using discretized action tokens. Across LIBERO, SimplerEnv, and real-robot AgileX, FlowVLA achieves state-of-the-art results and substantially improved sample efficiency, with ablations confirming the critical role of Visual CoT, flow supervision, and interleaved causal sequencing. The results demonstrate that motion-first world modeling yields more physically plausible forecasts, better language-grounding, and stronger transfer from simulation to real-world robotic manipulation.

Abstract

Many Vision-Language-Action (VLA) models are built upon an internal world model trained via next-frame prediction ``''. However, this paradigm attempts to predict the future frame's appearance directly, without explicitly reasoning about the underlying dynamics. \textbf{This lack of an explicit motion reasoning step} often leads to physically implausible visual forecasts and inefficient policy learning. To address this limitation, we introduce the \textbf{Visual Chain of Thought (Visual CoT)}, a paradigm that compels the model to first reason about \textbf{motion dynamics} before generating the future frame. We instantiate this paradigm by proposing \textbf{FlowVLA}, an autoregressive Transformer that explicitly materializes this reasoning process as ``'', where is an intermediate optical flow prediction that inherently encodes motion. By forcing the model to first follow the motion plan encoded by , this process inherently \textbf{aligns the pre-training objective of dynamics prediction with the downstream task of action generation.} We conduct experiments on challenging robotics manipulation benchmarks, as well as real-robot evaluations. Our FlowVLA not only generates \textbf{more coherent and physically plausible visual predictions}, but also achieves state-of-the-art policy performance with \textbf{substantially improved sample efficiency}, pointing toward a more principled foundation for world modeling in VLAs. Project page: https://irpn-lab.github.io/FlowVLA/

Paper Structure

This paper contains 21 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Two-Stage Training Paradigm of FlowVLA.(Top) Stage 1: World Model Pre-training with Visual CoT. The model learns to predict an intermediate motion representation (Flow at $T$) from an initial frame (Frame at $T$), and then forecasts the subsequent frame (Frame at ${T+1}$). This iterative process yields physically plausible, long-horizon video predictions. (Bottom) Stage 2: Policy Fine-tuning. Through fine-tuning, the pre-trained world model is adapted to generate precise robot action chunk (Action at $T$) from visual observations. This paradigm leverages the learned dynamics for efficient and accurate policy learning.
  • Figure 2: Model Architecture of FlowVLA. Our model instantiates the two-stage training paradigm in Figure \ref{['fig:teaser']}. (Left) Stage 1: World Model Pre-training with Visual CoT. Input frames are encoded into appearance tokens (pink). The model then autoregressively predicts an interleaved sequence of motion tokens (blue) and future appearance tokens. Our proposed $v_t \rightarrow f_t \rightarrow v_{t+1}$ prediction forces the model to reason about dynamics before forecasting the future. For conceptual clarity, the Image and Flow Tokenizers are visualized separately; in practice, they are the exact same module applied to both appearance and motion inputs. (Right) Stage 2: Policy Fine-tuning. The pre-trained world model is adapted for action prediction. Conditioned on a text instruction (gray) and the current observation (magenta), the model autoregressively predicts action tokens (green) that are decoded into robot action chunk.
  • Figure 3: AgileX Cobot dual-arm platform and real-world manipulation tasks: (a) system setup: leader arms are user-operated, follower arms mirror the actions. Vision is provided by a front camera for global scene view and wrist cameras for close-up workspace observation. (b) representative single-arm and bimanual operations: from simple single-arm tasks to complex long-horizon bimanual manipulations.
  • Figure 4: Analysis of Physical Plausibility on the Bridge V2 Dataset. This figure highlights common physical failures in the baseline model. In both examples, the baseline model (top row) struggles to maintain physical consistency, leading to implausible outcomes such as a disappearing manipulator or erratic object behavior. In contrast, FlowVLA (bottom row), guided by its motion-first reasoning, produces stable and physically coherent predictions that accurately reflect the scene's dynamics.
  • Figure 5: Analysis of Semantic Alignment on the Bridge V2 Dataset. This figure illustrates the baseline's failure to align predictions with language instructions. While the predicted frames from baseline model (top row) might appear visually plausible at a glance, the resulting motion does not correspond to the specified task (e.g., moving an object in the wrong direction). FlowVLA (bottom row) again demonstrates superior performance, correctly interpreting the command and generating a corresponding visual trajectory. This underscores that our Visual CoT not only improves physical realism but also enhances the model's ability to ground language in action.
  • ...and 1 more figures