Table of Contents
Fetching ...

RELIC: Interactive Video World Model with Long-Horizon Memory

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan

TL;DR

<RELI C> addresses the triad of real-time long-horizon streaming, coherent spatial memory, and precise user control in interactive video world modeling. It introduces memory-aware latent tokens stored in a KV cache and a long-horizon teacher–student distillation pipeline that yields 20-second, 16 FPS video generation from a single image and text cue, with action-conditioned control and 3D-consistent content retrieval. Through a curated Unreal Engine dataset, memory-compression KV mechanisms, replayed back-propagation distillation, and runtime optimizations, RELIC achieves superior visual quality, tighter action following, and robust long-horizon memory compared with prior work. The approach offers a scalable, foundation-grade platform for next-generation interactive world simulators and embodied AI capable of real-time exploration and memory-consistent content retrieval across diverse scenes and styles.

Abstract

A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.

RELIC: Interactive Video World Model with Long-Horizon Memory

TL;DR

<RELI C> addresses the triad of real-time long-horizon streaming, coherent spatial memory, and precise user control in interactive video world modeling. It introduces memory-aware latent tokens stored in a KV cache and a long-horizon teacher–student distillation pipeline that yields 20-second, 16 FPS video generation from a single image and text cue, with action-conditioned control and 3D-consistent content retrieval. Through a curated Unreal Engine dataset, memory-compression KV mechanisms, replayed back-propagation distillation, and runtime optimizations, RELIC achieves superior visual quality, tighter action following, and robust long-horizon memory compared with prior work. The approach offers a scalable, foundation-grade platform for next-generation interactive world simulators and embodied AI capable of real-time exploration and memory-consistent content retrieval across diverse scenes and styles.

Abstract

A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.

Paper Structure

This paper contains 41 sections, 7 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: RELIC is an interactive video world model that allows users to freely explore virtual scenes initialized from an arbitrary first-frame image in real time. Built as a 14B-parameter autoregressive model, RELIC generates videos at 480×832 resolution, 16 FPS, for up to 20 seconds, exhibiting consistent long-horizon spatial memory.
  • Figure 2: Dataset statistics visualization. Left: video duration distribution; Right: action distribution.
  • Figure 3: The data curation pipeline in RELIC. Given a set of 3D scenes, we manually collect thousands of camera trajectories and generate high-quality video-action-text triplets through a series of data filtering, captioning, and balancing steps.
  • Figure 4: Model Pipeline. Starting from an input image and a sequence of noisy video latents, our DiT-based architecture generates a 20-second video conditioned on text, action labels, and camera poses. Each DiT block integrates YaRN-RoPE, SDPA with QK-Norm, and cross-attention to conditioning tokens. Camera and action information are embedded through dedicated encoders, and conditioning is injected throughout the denoising process to produce spatially consistent and action-aligned video frames.
  • Figure 5: ODE initialization. We convert a bidirectional video diffusion model into a causal generator by initializing the student on a set of ODE trajectories obtained from the teacher. To achieve this, we adopt a hybrid forcing strategy that combines teacher forcing and diffusion forcing (mask shown on the right).
  • ...and 5 more figures