Table of Contents
Fetching ...

TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control

Zhenyang Liu, Yongchong Gu, Sixiao Zheng, Yanwei Fu, Xiangyang Xue, Yu-Gang Jiang

TL;DR

TriVLA introduces an episodic world model implemented as a triple-system architecture that fuses a vision-language grounding module with a video-diffusion–based dynamics predictor to accumulate, recall, and forecast sequential experiences for robot control. The System 2 EMP and System 3 EDP jointly build an episodic representation that informs System 1's diffusion-based policy, enabling strong long-horizon planning and open-ended instruction following. Across simulated and real-world benchmarks, TriVLA outperforms state-of-the-art baselines and demonstrates data-efficient learning and real-time operation, underscoring the value of temporally extended, context-aware reasoning in embodied AI. The work highlights a scalable path toward robust, generalizable robot intelligence by integrating cognitive-inspired episodic memory with modern multimodal perception and predictive dynamics.

Abstract

Recent advances in vision-language models (VLMs) have enabled robots to follow open-ended instructions and demonstrate impressive commonsense reasoning. However, current vision-language-action (VLA) frameworks primarily rely on static representations and limited temporal context, restricting agents to short-horizon, reactive behaviors and hindering robust generalization in dynamic embodied environments. Inspired by cognitive neuroscience theories of episodic memory, we propose, to our knowledge, one of the first formalized episodic world models in VLA, enabling embodied robots to accumulate, recall, and predict sequential experiences. As an instantiation of this concept, our unified TriVLA realizes the episodic world model through a triple-system architecture: integrating multimodal grounding from a pretrained VLM (System 2) and temporally rich dynamics perception from a video diffusion model (System 3). This enables the agent to accumulate and recall sequential experiences, interpret current contexts, and predict future environmental evolution. Guided by episodic representations that span both the past and anticipated future, the downstream policy (System 1) generates coherent, context-aware action sequences through flow-matching and cross-modal attention mechanisms. Experimental results show that TriVLA operates efficiently at approximately 36 Hz and consistently outperforms baseline models on standard benchmarks and challenging real-world manipulation tasks. It demonstrates strong long-horizon planning and open-ended intent understanding, showcasing the advantages of episodic world model-inspired reasoning for robust, generalizable robot intelligence. Project Page: https://zhenyangliu.github.io/TriVLA/.

TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control

TL;DR

TriVLA introduces an episodic world model implemented as a triple-system architecture that fuses a vision-language grounding module with a video-diffusion–based dynamics predictor to accumulate, recall, and forecast sequential experiences for robot control. The System 2 EMP and System 3 EDP jointly build an episodic representation that informs System 1's diffusion-based policy, enabling strong long-horizon planning and open-ended instruction following. Across simulated and real-world benchmarks, TriVLA outperforms state-of-the-art baselines and demonstrates data-efficient learning and real-time operation, underscoring the value of temporally extended, context-aware reasoning in embodied AI. The work highlights a scalable path toward robust, generalizable robot intelligence by integrating cognitive-inspired episodic memory with modern multimodal perception and predictive dynamics.

Abstract

Recent advances in vision-language models (VLMs) have enabled robots to follow open-ended instructions and demonstrate impressive commonsense reasoning. However, current vision-language-action (VLA) frameworks primarily rely on static representations and limited temporal context, restricting agents to short-horizon, reactive behaviors and hindering robust generalization in dynamic embodied environments. Inspired by cognitive neuroscience theories of episodic memory, we propose, to our knowledge, one of the first formalized episodic world models in VLA, enabling embodied robots to accumulate, recall, and predict sequential experiences. As an instantiation of this concept, our unified TriVLA realizes the episodic world model through a triple-system architecture: integrating multimodal grounding from a pretrained VLM (System 2) and temporally rich dynamics perception from a video diffusion model (System 3). This enables the agent to accumulate and recall sequential experiences, interpret current contexts, and predict future environmental evolution. Guided by episodic representations that span both the past and anticipated future, the downstream policy (System 1) generates coherent, context-aware action sequences through flow-matching and cross-modal attention mechanisms. Experimental results show that TriVLA operates efficiently at approximately 36 Hz and consistently outperforms baseline models on standard benchmarks and challenging real-world manipulation tasks. It demonstrates strong long-horizon planning and open-ended intent understanding, showcasing the advantages of episodic world model-inspired reasoning for robust, generalizable robot intelligence. Project Page: https://zhenyangliu.github.io/TriVLA/.

Paper Structure

This paper contains 26 sections, 6 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: TriVLA is a unified Vision-Language-Action framework that adopts a triple-system architecture inspired by the episodic world model. Image and language inputs are processed by a Vision-Language Model for multimodal perception. A Video Diffusion Model provides dynamic world modeling and future prediction. The policy module integrates sequential outputs, robot state, and action history and generates real-time actions for complex manipulation tasks.
  • Figure 2: Comparison between dual-system architectures and our episodic world model-guided TriVLA.TriVLA implements the episodic world model using a triple-system architecture. In contrast, previous dual-system methods relied on static representations and limited temporal context, which restricted agents to short-horizon, reactive behaviors in dynamic environments.
  • Figure 3: The pipeline of TriVLA.TriVLA is a unified Vision-Language-Action framework built on a triple-system paradigm. System 2 employs a pre-trained Eagle-2 VLM for episodic multimodal perception, while System 3 utilizes a general-purpose VDM to model episodic dynamics and sequential changes. Together, these modules form a joint episodic world model with rich, temporally extended representations. System 1 serves as the policy module, applying action flow-matching to integrate all outputs along with robot state and action history.
  • Figure 4: Qualitative case study of short-horizon tasks. Our TriVLA performs well on short-horizon tasks. In the real-world tasks, it leverages a triple-system architecture that unifies Episodic Multimodal Perception and Dynamics Perception—both crucial for generalizable policy learning.
  • Figure 5: Qualitative results of long-horizon tasks. Our TriVLA performs well on long-horizon tasks. In the CALVIN and real-world tasks, it leverages a triple-system architecture that unifies multiple systems for generalizable policy learning.
  • ...and 6 more figures