Table of Contents
Fetching ...

WorldCompass: Reinforcement Learning for Long-Horizon World Models

Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, Chunchao Guo, Zhou Zhao

TL;DR

WorldCompass introduces a post-training reinforcement learning framework tailored to long-horizon, autoregressive video-based world models. It keyly combines a clip-level rollout strategy, complementary reward functions for interaction following and visual quality, and a negative-aware fine-tuning optimization to enable efficient and robust RL training. The approach yields substantial gains in action fidelity and perceptual quality on the state-of-the-art WorldPlay model across short, medium, and long horizons, including complex compositional actions. This work demonstrates that targeted RL post-training can significantly enhance both controllability and visual realism in interactive world models, with practical training efficiency suitable for large-scale deployment.

Abstract

This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.

WorldCompass: Reinforcement Learning for Long-Horizon World Models

TL;DR

WorldCompass introduces a post-training reinforcement learning framework tailored to long-horizon, autoregressive video-based world models. It keyly combines a clip-level rollout strategy, complementary reward functions for interaction following and visual quality, and a negative-aware fine-tuning optimization to enable efficient and robust RL training. The approach yields substantial gains in action fidelity and perceptual quality on the state-of-the-art WorldPlay model across short, medium, and long horizons, including complex compositional actions. This work demonstrates that targeted RL post-training can significantly enhance both controllability and visual realism in interactive world models, with practical training efficiency suitable for large-scale deployment.

Abstract

This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.
Paper Structure (35 sections, 7 equations, 8 figures, 3 tables)

This paper contains 35 sections, 7 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of WorldCompass. 1) Starting from environmental prompts and action sequences, we generate shared prefix video clips. At the $n$-th target clip, we perform clip-level rollouts to generate a set of candidate video clips. 2) We design reliable reward functions to evaluate the action-following accuracy and visual quality of rollout samples. 3) We employ efficient RL algorithm to optimize the model, steering it toward generating high-scoring video clips.
  • Figure 2: Evolution of interaction following and visual quality scores during the RL training of WorldPlay (HunyuanVideo-1.5 version). These reward metrics are evaluated on a fixed subset of the test set with complex combined action.
  • Figure 3: Qualitative comparisons under complex combined action sequence.
  • Figure 4: Qualitative comparisons under simple basic action sequence.
  • Figure 5: Visualization Case 1. The input action sequence consists of W+A" (moving forward-left) for the first half, followed by $\rightarrow$" (turning right) in the second half.
  • ...and 3 more figures