Table of Contents
Fetching ...

Enhancing Policy Learning with World-Action Model

Yuci Han, Alper Yilmaz

Abstract

This paper presents the World-Action Model (WAM), an action-regularized world model that jointly reasons over future visual observations and the actions that drive state transitions. Unlike conventional world models trained solely via image prediction, WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions, encouraging the learned representations to capture action-relevant structure critical for downstream control. We evaluate WAM on enhancing policy learning across eight manipulation tasks from the CALVIN benchmark. We first pretrain a diffusion policy via behavioral cloning on world model latents, then refine it with model-based PPO inside the frozen world model. Without modifying the policy architecture or training procedure, WAM improves average behavioral cloning success from 59.4% to 71.2% over DreamerV2 and DiWA baselines. After PPO fine-tuning, WAM achieves 92.8% average success versus 79.8% for the baseline, with two tasks reaching 100%, using 8.7x fewer training steps.

Enhancing Policy Learning with World-Action Model

Abstract

This paper presents the World-Action Model (WAM), an action-regularized world model that jointly reasons over future visual observations and the actions that drive state transitions. Unlike conventional world models trained solely via image prediction, WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions, encouraging the learned representations to capture action-relevant structure critical for downstream control. We evaluate WAM on enhancing policy learning across eight manipulation tasks from the CALVIN benchmark. We first pretrain a diffusion policy via behavioral cloning on world model latents, then refine it with model-based PPO inside the frozen world model. Without modifying the policy architecture or training procedure, WAM improves average behavioral cloning success from 59.4% to 71.2% over DreamerV2 and DiWA baselines. After PPO fine-tuning, WAM achieves 92.8% average success versus 79.8% for the baseline, with two tasks reaching 100%, using 8.7x fewer training steps.

Paper Structure

This paper contains 21 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 2: (a) Standard world models predict only future observations, treating actions solely as conditioning inputs. (b) Our World-Action Model adds an inverse dynamics head that jointly predicts observations and actions during training. (c) The action-aware world model serves as a learned simulator for offline policy fine-tuning via PPO.
  • Figure 3: WAM architecture. Observations $x_t$ are encoded and produce posterior $z_t$, which is regularized toward the prior $\hat{z}_t$ via KL divergence. The inverse dynamics head predicts actions $\hat{a}_t$ from consecutive encoder embeddings, cascading action-aware structure through the posterior to the prior. The decoder reconstructs observations $\hat{x}_t$ and a reward estimator provides task-completion signals for policy fine-tuning.
  • Figure 4: Qualitative comparison of imagined rollouts on the CALVIN benchmark. We visualize predicted frames at selected timesteps from both static and gripper cameras. Compared to DreamerV2, our WAM produces more realistic future state predictions across the entire rollout horizon.
  • Figure 5: Behavioral cloning evaluation curves across all eight CALVIN tasks. WAM (orange) consistently reaches higher success rates than DiWA (blue) and converges faster on most tasks.