Table of Contents
Fetching ...

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao

Abstract

Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Abstract

Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.
Paper Structure (24 sections, 2 theorems, 16 equations, 37 figures, 5 tables, 2 algorithms)

This paper contains 24 sections, 2 theorems, 16 equations, 37 figures, 5 tables, 2 algorithms.

Key Result

theorem 1

Let $\mathcal{C}_n$ be the frozen context at step $n$, $\tilde{r}(x_n, \mathcal{C}_n) \in [0,1]$ the normalized advantage, and $\alpha \triangleq \alpha(x_n^t, \mathcal{C}_n) = \mathbb{E}_{\pi_{\mathrm{old}}}[\tilde{r} \mid x_n^t, \mathcal{C}_n]$ the posterior positive probability. Define the implic and the local policy loss ($\beta > 0$ controls negative repulsion strength): where $v^{\pm}$ are

Figures (37)

  • Figure 1: Astrolabe efficiently aligns distilled streaming video models with human preferences without re-distillation, enhancing baselines (e.g., Causal Forcing zhu2026causal, LongLive yang2025longlive and Infinite-RoPE yesiltepe2025infinityrope) by mitigating artifacts and improving temporal consistency. We demonstrate boosted perceptual quality across: (Top) single-prompt short, (Middle) single-prompt long, and (Bottom) multi-prompt long video generation.
  • Figure 2: Overview of Astrolabe. We propose a memory-efficient RL framework for distilled streaming video models. The method combines group-wise streaming rollout using a rolling KV cache for efficient group-wise sampling (see left), and clip-level forward-process RL for solver-agnostic optimization (see middle). To scale to long videos, we utilize Streaming Long Tuning with detached historical gradients. Furthermore, a multi-reward formulation paired with uncertainty-based selective regularization is employed to effectively mitigate reward hacking during training (see right). The pseudocode of the algorithm can be found in the supplementary materials.
  • Figure 3: Qualitative comparison under the short-video, single-prompt setting. We evaluate our framework (+Ours) against other baselines. Visual results confirm that our method generates videos with significantly sharper textures and superior motion coherence, aligning better with human preferences. More results can be found in supplementary material.
  • Figure 4: Qualitative results under the single-prompt long-video setting. Our framework (+Ours) effectively translates alignment optimizations from short videos to extended temporal horizons. Our approach delivers enhanced visual details and more stable throughout the sequence.
  • Figure 5: Performance improvements across different models. We evaluate our method on three models. The dashed grey lines indicate the baseline performance of the respective base models. The results demonstrate that our approach consistently improves both HPSv3 and MQ scores across all three models.
  • ...and 32 more figures

Theorems & Definitions (4)

  • theorem 1: Conditional Improvement via Advantage Guidance
  • proof
  • theorem 2: Performance Lower Bound with Selective Trust Region
  • proof