Table of Contents
Fetching ...

Video-T1: Test-Time Scaling for Video Generation

Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan

TL;DR

This work addresses the challenge of improving text-conditioned video generation without retraining by introducing Test-Time Scaling (TTS) for videos. It reframes video generation as a search over trajectories in Gaussian noise space, using a Video-T1 framework that employs test-time verifiers and heuristic search, notably Random Linear Search and a new Tree-of-Frames (ToF) method. ToF achieves efficient, autoregressive exploration with image-level alignment, hierarchical prompting, and heuristic pruning, reducing computational cost while maintaining high-quality, text-aligned outputs. Across multiple diffusion-based and autoregressive models, TTS yields consistent quality gains, with larger models benefiting more from expanded inference budgets, highlighting the practical potential of inference-time optimization for video synthesis.

Abstract

With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1

Video-T1: Test-Time Scaling for Video Generation

TL;DR

This work addresses the challenge of improving text-conditioned video generation without retraining by introducing Test-Time Scaling (TTS) for videos. It reframes video generation as a search over trajectories in Gaussian noise space, using a Video-T1 framework that employs test-time verifiers and heuristic search, notably Random Linear Search and a new Tree-of-Frames (ToF) method. ToF achieves efficient, autoregressive exploration with image-level alignment, hierarchical prompting, and heuristic pruning, reducing computational cost while maintaining high-quality, text-aligned outputs. Across multiple diffusion-based and autoregressive models, TTS yields consistent quality gains, with larger models benefiting more from expanded inference budgets, highlighting the practical potential of inference-time optimization for video synthesis.

Abstract

With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1

Paper Structure

This paper contains 20 sections, 10 equations, 18 figures, 2 tables, 2 algorithms.

Figures (18)

  • Figure 1: Video-T1: We present the generative effects and performance improvements of video generation under Test-Time Scaling (TTS) settings. The videos generated with TTS are of higher quality and more consistent with the prompt than those generated without TTS.
  • Figure 2: Results of Test-Time Scaling for Video Generation. As the number of samples in the search space increases by scaling test-time computation (TTS), the models' performance exhibits consistent improvement (In the bar chart, light colors correspond to the results without TTS, while dark colors represent the improvement after TTS.).
  • Figure 3: Pipeline of Test-Time Scaling for Video Generation.Top:Random Linear Search for TTS video generation is to randomly sample Gaussian noises, prompt the video generator to generate sequential of video clips through step-by-step denoising in a linear manner, and select the highest score form the test verifiers. Bottom:Tree of Frames (ToF) Search for TTS video generation is to divide the video generation process into three stages: (a) the first stage performs image-level alignment that influences the later frames; (b) the second stage is to apply dynamic prompt in test verifiers $\mathcal{V}$ to focus on motion stability, physical plausibility to provide feedback that guides heuristic searching process; (c) the last stage assesses the overall quality of the video and select the video with highest alignment with text prompts.
  • Figure 4: Performance of random linear search on different video models and verifiers. The top row displays results for autoregressive models, while the bottom row shows diffusion-based models. The initial points of the curves represent the random video sample results without TTS. The models are arranged in order of increasing parameter count from left to right; different colored curves represent the performance trends under various verifiers, and the gray dashed line corresponds to the baseline established by VBench, which serves as a ground-truth verifier.
  • Figure 5: Comparison between random linear search and ToF search. The red curve represents random linear search. The blue curve represents ToF search, with the dashed line being the predicted curve from a geometric series decay approximation. Curve fitting reveals that similar subsequent trends tend to converge to an upper limit.
  • ...and 13 more figures