Table of Contents
Fetching ...

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal

TL;DR

VideoRepair tackles misalignment in text-to-video generation by introducing a two-stage, training-free refinement framework that does not require additional generators. It first performs video refinement planning via LLM- and MLLM-based evaluation questions to identify misaligned content, then conducts localized refinement using a Region-Preserving Segmentation (RPS) module to preserve correctly generated regions while regenerating misaligned areas with region-specific prompts and selective noise reinitialization. The approach yields substantial gains in text-video alignment on EvalCrafter and T2V-CompBench while maintaining visual quality, and supports iterative refinements for progressive improvement. Overall, VideoRepair provides a practical, model-agnostic pathway to tighten prompt guidance in diffusion-based T2V systems, with implications for more reliable, compositional video generation in real-world applications.

Abstract

Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of two stages: In (1) video refinement planning, we first detect misalignments by generating fine-grained evaluation questions and answering them using an MLLM. Based on video evaluation outputs, we identify accurately generated objects and construct localized prompts to precisely refine misaligned regions. In (2) localized refinement, we enhance video alignment by 'repairing' the misaligned regions from the original video while preserving the correctly generated areas. This is achieved by frame-wise region decomposition using our Region-Preserving Segmentation (RPS) module. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

TL;DR

VideoRepair tackles misalignment in text-to-video generation by introducing a two-stage, training-free refinement framework that does not require additional generators. It first performs video refinement planning via LLM- and MLLM-based evaluation questions to identify misaligned content, then conducts localized refinement using a Region-Preserving Segmentation (RPS) module to preserve correctly generated regions while regenerating misaligned areas with region-specific prompts and selective noise reinitialization. The approach yields substantial gains in text-video alignment on EvalCrafter and T2V-CompBench while maintaining visual quality, and supports iterative refinements for progressive improvement. Overall, VideoRepair provides a practical, model-agnostic pathway to tighten prompt guidance in diffusion-based T2V systems, with implications for more reliable, compositional video generation in real-world applications.

Abstract

Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of two stages: In (1) video refinement planning, we first detect misalignments by generating fine-grained evaluation questions and answering them using an MLLM. Based on video evaluation outputs, we identify accurately generated objects and construct localized prompts to precisely refine misaligned regions. In (2) localized refinement, we enhance video alignment by 'repairing' the misaligned regions from the original video while preserving the correctly generated areas. This is achieved by frame-wise region decomposition using our Region-Preserving Segmentation (RPS) module. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.

Paper Structure

This paper contains 51 sections, 2 equations, 24 figures, 5 tables.

Figures (24)

  • Figure 1: VideoRepair is a model-agnostic, training-free, automatic refinement framework for improving alignments in text-to-video generation. Given an initial video from a text-to-video generation model, VideoRepair refines video in two stages: (1) video refinement planning and (2) localized refinement. The black-white mask in the bottom left of each example indicates the localized refinement plan (black: regions to preserve / white: regions to refine).
  • Figure 2: Comparison of different refinement methods for alignment. (a) Prompt optimization (e.g., OPT2I opt2i) by LLM-based rewriting without visual/fine-grained feedback, making the search expensive (e.g., 30 iterations). (b) Recent work on localized feedback (e.g., SLD sld) provides visual guidance but relies on an external layout-guided generation module, often leading to unnatural refinements. (c) VideoRepair is a training-free, model-agnostic refinement framework for T2V alignment that provides fine-grained localized visual guidance and uses the original T2V model.
  • Figure 3: Illustration of VideoRepair. VideoRepair refines the generated video in two stages: (1) video refinement planning (\ref{['sec:subsec:video_refinement_planning']}), (2) localized refinement (\ref{['sec:subsec:localized_refinement']}). Given the prompt $p$, we first generate a fine-grained evaluation question set and ask the MLLM to provide answers. Next, we identify accurately generated objects $O^*$ and plan the refinement $p^r$ of other regions using MLLM/LLM. Based on $O^*$, we determine which regions to preserve or refine using the RPS module. Finally, we apply localized refinement with the original T2V model.
  • Figure 4: Videos generated with T2V-turbo and refinement frameworks (OPT2I / SLD / VideoRepair) on T2V-turbo.VideoRepair successfully addresses object and attribute misalignment issues (e.g., numeracy, spatial relationship, attribute blending) compared to T2V-turbo and other refinement methods. More visualization examples with T2V-turbo and VideoCrafter2 are provided in the appendix.
  • Figure 5: The iterative refinement of VideoRepair. Videos in each column represent the outputs of successive refinement iterations, where the output from the previous step serves as the input for the current step. The text at the bottom of each video row indicates the corresponding text prompt. More visualization examples are provided in the appendix.
  • ...and 19 more figures