Table of Contents
Fetching ...

ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos

Haolin Yang, Feilong Tang, Ming Hu, Qingyu Yin, Yulong Li, Yexin Liu, Zelin Peng, Peng Gao, Junjun He, Zongyuan Ge, Imran Razzak

TL;DR

This paper tackles the challenge of generating coherent long videos with Video Diffusion Models by introducing ScalingNoise, an inference-time beam-search strategy that searches for golden initial noises. It couples one-step denoising evaluation with a long-term, anchor-based reward to guide noise selection, and uses a tilted distribution to preserve diversity. The approach achieves superior long-range content consistency and frame quality across benchmarks, while dramatically reducing per-step search cost. This method enables more scalable, high-quality long video generation under realistic compute budgets.

Abstract

Video diffusion models (VDMs) facilitate the generation of high-quality videos, with current research predominantly concentrated on scaling efforts during training through improvements in data quality, computational resources, and model complexity. However, inference-time scaling has received less attention, with most approaches restricting models to a single generation attempt. Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation. Building on this, we find that guiding the scaling inference-time search of VDMs to identify better noise candidates not only evaluates the quality of the frames generated in the current step but also preserves the high-level object features by referencing the anchor frame from previous multi-chunks, thereby delivering long-term value. Our analysis reveals that diffusion models inherently possess flexible adjustments of computation by varying denoising steps, and even a one-step denoising approach, when guided by a reward signal, yields significant long-term benefits. Based on the observation, we proposeScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process to improve global content consistency and visual diversity. Specifically, we perform one-step denoising to convert initial noises into a clip and subsequently evaluate its long-term value, leveraging a reward model anchored by previously generated content. Moreover, to preserve diversity, we sample candidates from a tilted noise distribution that up-weights promising noises. In this way, ScalingNoise significantly reduces noise-induced errors, ensuring more coherent and spatiotemporally consistent video generation. Extensive experiments on benchmark datasets demonstrate that the proposed ScalingNoise effectively improves long video generation.

ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos

TL;DR

This paper tackles the challenge of generating coherent long videos with Video Diffusion Models by introducing ScalingNoise, an inference-time beam-search strategy that searches for golden initial noises. It couples one-step denoising evaluation with a long-term, anchor-based reward to guide noise selection, and uses a tilted distribution to preserve diversity. The approach achieves superior long-range content consistency and frame quality across benchmarks, while dramatically reducing per-step search cost. This method enables more scalable, high-quality long video generation under realistic compute budgets.

Abstract

Video diffusion models (VDMs) facilitate the generation of high-quality videos, with current research predominantly concentrated on scaling efforts during training through improvements in data quality, computational resources, and model complexity. However, inference-time scaling has received less attention, with most approaches restricting models to a single generation attempt. Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation. Building on this, we find that guiding the scaling inference-time search of VDMs to identify better noise candidates not only evaluates the quality of the frames generated in the current step but also preserves the high-level object features by referencing the anchor frame from previous multi-chunks, thereby delivering long-term value. Our analysis reveals that diffusion models inherently possess flexible adjustments of computation by varying denoising steps, and even a one-step denoising approach, when guided by a reward signal, yields significant long-term benefits. Based on the observation, we proposeScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process to improve global content consistency and visual diversity. Specifically, we perform one-step denoising to convert initial noises into a clip and subsequently evaluate its long-term value, leveraging a reward model anchored by previously generated content. Moreover, to preserve diversity, we sample candidates from a tilted noise distribution that up-weights promising noises. In this way, ScalingNoise significantly reduces noise-induced errors, ensuring more coherent and spatiotemporally consistent video generation. Extensive experiments on benchmark datasets demonstrate that the proposed ScalingNoise effectively improves long video generation.

Paper Structure

This paper contains 17 sections, 8 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: An overview of how ScalingNoise improves long video generation through inference-time search. (a) Chunk-by-chunk and FIFO-Diffusion methods often suffer from accumulated errors and visual degradation over long sequences. (b) ScalingNoise mitigates this by conducting a tailored step-by-step beam search for suitable initial noises, guided by a reward model that incorporates an anchor frame to ensure a long-term signal. (c) At each step, we perform one-step denoising on candidate noises to obtain a clearer clip for evaluation; the reward model then predicts the long-term value of each candidate, helping avoid noises that could introduce future inconsistencies.
  • Figure 2: Comparisons of $\text{FVD}_{128}$ and IS scores on UCF-101. ScalingNoise utilizes Latte ma2024latte as its baseline, where the number of beam sizes is 2, and noise candidates are 5. The FVD and IS scores of the other algorithms are obtained from their respective papers, and PVDM yu2023pvdm denotes PVDM-L (400-400s).
  • Figure 3: User Study. Win rate of videos generated using ScalingNoise compared with other inference-time scaling methods.
  • Figure 4: The upper part of this figure represents a greedy approach to generate long videos. In contrast, the tree-structured searching process of ScalingNoise is outlined below. Our prompt is “Red wine is poured into a glass. highly detailed, cinematic, arc shot, high contrast, soft lighting".
  • Figure 5: (a) The two figures are boxplots showing the tendency of scaling beam sizes for ScaleNoise based on two paradigms, in order of FIFO-Diffusion and Chunk by chunk. (b) From Left to Right: Correction of reward model DINO and CLIP feature similarity score and final subject consistency. All points are generated by VideoCraft2.
  • ...and 4 more figures