TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control

Zhenyang Liu; Yongchong Gu; Sixiao Zheng; Yanwei Fu; Xiangyang Xue; Yu-Gang Jiang

TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control

Zhenyang Liu, Yongchong Gu, Sixiao Zheng, Yanwei Fu, Xiangyang Xue, Yu-Gang Jiang

TL;DR

TriVLA introduces an episodic world model implemented as a triple-system architecture that fuses a vision-language grounding module with a video-diffusion–based dynamics predictor to accumulate, recall, and forecast sequential experiences for robot control. The System 2 EMP and System 3 EDP jointly build an episodic representation that informs System 1's diffusion-based policy, enabling strong long-horizon planning and open-ended instruction following. Across simulated and real-world benchmarks, TriVLA outperforms state-of-the-art baselines and demonstrates data-efficient learning and real-time operation, underscoring the value of temporally extended, context-aware reasoning in embodied AI. The work highlights a scalable path toward robust, generalizable robot intelligence by integrating cognitive-inspired episodic memory with modern multimodal perception and predictive dynamics.

Abstract

Recent advances in vision-language models (VLMs) have enabled robots to follow open-ended instructions and demonstrate impressive commonsense reasoning. However, current vision-language-action (VLA) frameworks primarily rely on static representations and limited temporal context, restricting agents to short-horizon, reactive behaviors and hindering robust generalization in dynamic embodied environments. Inspired by cognitive neuroscience theories of episodic memory, we propose, to our knowledge, one of the first formalized episodic world models in VLA, enabling embodied robots to accumulate, recall, and predict sequential experiences. As an instantiation of this concept, our unified TriVLA realizes the episodic world model through a triple-system architecture: integrating multimodal grounding from a pretrained VLM (System 2) and temporally rich dynamics perception from a video diffusion model (System 3). This enables the agent to accumulate and recall sequential experiences, interpret current contexts, and predict future environmental evolution. Guided by episodic representations that span both the past and anticipated future, the downstream policy (System 1) generates coherent, context-aware action sequences through flow-matching and cross-modal attention mechanisms. Experimental results show that TriVLA operates efficiently at approximately 36 Hz and consistently outperforms baseline models on standard benchmarks and challenging real-world manipulation tasks. It demonstrates strong long-horizon planning and open-ended intent understanding, showcasing the advantages of episodic world model-inspired reasoning for robust, generalizable robot intelligence. Project Page: https://zhenyangliu.github.io/TriVLA/.

TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control

TL;DR

Abstract

TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)