Table of Contents
Fetching ...

InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO

Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, Guo-jun Qi

TL;DR

InfLVG tackles the difficulty of extending autoregressive text-to-video models to long, multi-scene videos by introducing an inference-time context-selection policy trained with Group Relative Policy Optimization. The policy ranks and preserves only the top-K semantically relevant context tokens, keeping computation fixed while maintaining cross-scene consistency and alignment with evolving prompts via a hybrid reward that combines content identity, text-video alignment, and artifact suppression. A new benchmark, CsVBench, assesses cross-scene coherence, with experiments showing up to 9× video length extension and robust performance across scenes and subjects. The work enables scalable long-video generation without long-form training data and provides a practical framework with reusable components for future improvements in long-form video synthesis.

Abstract

Recent advances in text-to-video generation, particularly with autoregressive models, have enabled the synthesis of high-quality videos depicting individual scenes. However, extending these models to generate long, cross-scene videos remains a significant challenge. As the context length grows during autoregressive decoding, computational costs rise sharply, and the model's ability to maintain consistency and adhere to evolving textual prompts deteriorates. We introduce InfLVG, an inference-time framework that enables coherent long video generation without requiring additional long-form video data. InfLVG leverages a learnable context selection policy, optimized via Group Relative Policy Optimization (GRPO), to dynamically identify and retain the most semantically relevant context throughout the generation process. Instead of accumulating the entire generation history, the policy ranks and selects the top-$K$ most contextually relevant tokens, allowing the model to maintain a fixed computational budget while preserving content consistency and prompt alignment. To optimize the policy, we design a hybrid reward function that jointly captures semantic alignment, cross-scene consistency, and artifact reduction. To benchmark performance, we introduce the Cross-scene Video Benchmark (CsVBench) along with an Event Prompt Set (EPS) that simulates complex multi-scene transitions involving shared subjects and varied actions/backgrounds. Experimental results show that InfLVG can extend video length by up to 9$\times$, achieving strong consistency and semantic fidelity across scenes. Our code is available at https://github.com/MAPLE-AIGC/InfLVG.

InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO

TL;DR

InfLVG tackles the difficulty of extending autoregressive text-to-video models to long, multi-scene videos by introducing an inference-time context-selection policy trained with Group Relative Policy Optimization. The policy ranks and preserves only the top-K semantically relevant context tokens, keeping computation fixed while maintaining cross-scene consistency and alignment with evolving prompts via a hybrid reward that combines content identity, text-video alignment, and artifact suppression. A new benchmark, CsVBench, assesses cross-scene coherence, with experiments showing up to 9× video length extension and robust performance across scenes and subjects. The work enables scalable long-video generation without long-form training data and provides a practical framework with reusable components for future improvements in long-form video synthesis.

Abstract

Recent advances in text-to-video generation, particularly with autoregressive models, have enabled the synthesis of high-quality videos depicting individual scenes. However, extending these models to generate long, cross-scene videos remains a significant challenge. As the context length grows during autoregressive decoding, computational costs rise sharply, and the model's ability to maintain consistency and adhere to evolving textual prompts deteriorates. We introduce InfLVG, an inference-time framework that enables coherent long video generation without requiring additional long-form video data. InfLVG leverages a learnable context selection policy, optimized via Group Relative Policy Optimization (GRPO), to dynamically identify and retain the most semantically relevant context throughout the generation process. Instead of accumulating the entire generation history, the policy ranks and selects the top- most contextually relevant tokens, allowing the model to maintain a fixed computational budget while preserving content consistency and prompt alignment. To optimize the policy, we design a hybrid reward function that jointly captures semantic alignment, cross-scene consistency, and artifact reduction. To benchmark performance, we introduce the Cross-scene Video Benchmark (CsVBench) along with an Event Prompt Set (EPS) that simulates complex multi-scene transitions involving shared subjects and varied actions/backgrounds. Experimental results show that InfLVG can extend video length by up to 9, achieving strong consistency and semantic fidelity across scenes. Our code is available at https://github.com/MAPLE-AIGC/InfLVG.

Paper Structure

This paper contains 18 sections, 11 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Challenges in autoregressive long video generation across scenes. (Left) Baseline models tend to repeat initial scene elements (e.g., background, cup) due to unfiltered context accumulation, failing to follow new prompts. (Right) InfLVG addresses this by selectively preserving relevant context, achieving both better prompt alignment and content consistency -- generating both face and environmental elements in accordance with the prompt and given context.
  • Figure 2: Different scene extension paradigms with InfLVG. (a) Single-scene extension, (b) Multi-scene transition with contextual awareness, and (c) Both single- and multi-scenes.
  • Figure 3: GRPO training pipeline. The DiT-based autoregressive video model generates a group of next scenes under top‐$K$ sampling actions. These videos are scored by the hybrid rewards and InfLVG utilizes GRPO to update the context selection model.
  • Figure 4: Illustration of Context Selection Model $\mathcal{F}_{\theta}$. Past video tokens and the current prompt are fused via cross-attention, then top-$K$ ranking is applied to sample context from the KV cache.
  • Figure 5: Comparison of different context selection designs under cross-scene video generation .
  • ...and 6 more figures