Table of Contents
Fetching ...

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, Yilun Du

Abstract

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors -- state plausibility and action reachability -- and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

Abstract

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors -- state plausibility and action reachability -- and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.

Paper Structure

This paper contains 54 sections, 5 theorems, 38 equations, 14 figures, 4 tables, 1 algorithm.

Key Result

Proposition 3.1

Assume there exists an identifiable verification subset $\mathcal{S}$ such that: (i) $\mathbf{z}_{\mathcal{S}}^{t+1}$ depends only on $(\mathbf{z}_{\mathcal{S}}^{t},\mathbf{a}^{t})$ and not on the rest of the scene; (ii) $(\mathbf{z}_{\mathcal{S}}^{t},\mathbf{a}^{t})$ stays on-support even when $(\m

Figures (14)

  • Figure 1: Overview of World Action Verifier, a framework that enables action-conditioned world models to self-improve from an asymmetric forward-inverse cycle: (i) a diverse subgoal generator proposes plausible future states, (ii) a sparse inverse model infers actions from a relevant subset of state features, and (iii) a world model rolls forward and verifies consistency between its predicted state and the proposed state.
  • Figure 2: Decomposing world model verification into state plausibility and action reachability.
  • Figure 3: Verification of robustness of WAV on MiniGrid. (Left) Sample efficiency comparison between Sparse IDM and the World Model with six objects. (Mid) Robustness to increasing state complexity. (Right) Robustness to growing environment stochasticity.
  • Figure 4: Evaluation of world model learning with WAV on MiniGrid. (Left) Action prediction accuracy of Sparse IDM and Vanilla IDM. Sparse IDM achieves better out-of-distribution generalization under limited data. (Mid) Correlation with Oracle ranking. We measure how well each method ranks informative samples using Spearman and Kendall correlations between method-assigned scores and Oracle scores. (Right) Comparison of acquisition strategies. Our proposed WAV outperforms standard baselines and approaches Oracle performance by prioritizing interaction-rich transitions.
  • Figure 5: Verification the robustness of WAV on Robomimic and ManiSkill. Correlation with Oracle ranking. We evaluate how well each method orders informative samples by computing Spearman rank correlations between the method’s assigned scores and Oracle scores on RoboMimic and ManiSkill environments. Higher correlation indicates closer agreement with the Oracle’s ranking.
  • ...and 9 more figures

Theorems & Definitions (11)

  • Proposition 3.1: Informal
  • Proposition 3.2: Informal
  • Definition C.1: On-support vs. out-of-support (OOS)
  • Definition C.2: Compositional OOS transition
  • Definition C.3: Source (insulated) set
  • Definition C.4: Verification subset
  • Theorem C.8: Identifiability of Self-Improvement
  • proof
  • Lemma C.9: OLS excess risk under isotropic Gaussian covariates
  • Proposition C.10: Exact forward--inverse gap in the linear--Gaussian model
  • ...and 1 more