Table of Contents
Fetching ...

RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

Hao Wu, Yuqi Li, Yuan Gao, Fan Xu, Fan Zhang, Kun Wang, Penghao Zhao, Qiufeng Wang, Yizhou Zhao, Weiyan Wang, Yingli Tian, Xian Wu, Xiaomeng Huang

Abstract

Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement-learning-based post-training. To reduce long-horizon rollout drift, we further introduce Sliding Window Re-encoding (SWR), a training-free inference strategy that periodically refreshes the generation context. Under our in-domain evaluation protocol, RoboAlign-R1 improves the aggregate six-dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM-based cross-check and a blinded human study. Meanwhile, SWR improves long-horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward-aligned post-training and stabilized long-horizon decoding improve task consistency, physical realism, and long-horizon prediction quality in robot video world models.

RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

Abstract

Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement-learning-based post-training. To reduce long-horizon rollout drift, we further introduce Sliding Window Re-encoding (SWR), a training-free inference strategy that periodically refreshes the generation context. Under our in-domain evaluation protocol, RoboAlign-R1 improves the aggregate six-dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM-based cross-check and a blinded human study. Meanwhile, SWR improves long-horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward-aligned post-training and stabilized long-horizon decoding improve task consistency, physical realism, and long-horizon prediction quality in robot video world models.

Paper Structure

This paper contains 115 sections, 1 theorem, 42 equations, 25 figures, 33 tables, 2 algorithms.

Key Result

Proposition 1

Under the stylized model above, sliding-window re-encoding with window size $W$ yields the bound which does not grow explicitly with $T$; the $\alpha\!\to\!0$ limit simplifies to $2W\varepsilon + \delta_q$. By contrast, vanilla AR without refresh admits a worst-case bound of order $\varepsilon/(1-\alpha)$, which blows up as $\alpha\!\to\!1$ and degrades to $\mathcal{O}(T\varepsilon)$ in the non

Figures (25)

  • Figure 1: Overview of RoboAlign-R1. RobotWorldBench provides robot-centric benchmark statistics and fine-grained annotations for training a multimodal teacher judge. The teacher is distilled into a lightweight student reward model for efficient reinforcement-learning-based post-training of robot video world models. In parallel, a sliding-window re-encoding strategy stabilizes long-horizon autoregressive rollouts by periodically refreshing the visual context during inference.
  • Figure 2: Token-based robot video world model. (a) Training: a dual-branch FSQ tokenizer produces context tokens $c$ and dynamics tokens $d_t$; discretized action tokens are interleaved and modeled by a 12-layer LLaMA Transformer with loss on dynamics tokens only. (b) Inference: context tokens are encoded once and cached; the model autoregressively predicts $\hat{d}_{t+1}$ and triggers sliding-window re-encoding every $W$ steps.
  • Figure 3: Sliding window re-encoding. Top: analogy to StreamingLLM xiao2023efficient, which retains an attention sink and a sliding KV-cache window in the language domain. Bottom: our approach periodically decodes the last predicted frame to pixel space, re-encodes it as fresh context tokens, and resets the autoregressive prompt, empirically limiting long-horizon drift while keeping the active KV-cache bounded by $O(W)$. Right: technical details of a single refresh step.
  • Figure 4: Qualitative comparison on a representative manipulation case. RoboAlign-R1 generates physically coherent sequences with accurate grasping.
  • Figure 5: Qualitative case study of RoboAlign-R1 on RT-1 and BridgeData V2, showing improved texture, shadow consistency, and background stability.
  • ...and 20 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Remark 1: Role of $\alpha$ and choice of $W$