VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

Daeun Lee; Jaehong Yoon; Jaemin Cho; Mohit Bansal

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal

TL;DR

VideoRepair tackles misalignment in text-to-video generation by introducing a two-stage, training-free refinement framework that does not require additional generators. It first performs video refinement planning via LLM- and MLLM-based evaluation questions to identify misaligned content, then conducts localized refinement using a Region-Preserving Segmentation (RPS) module to preserve correctly generated regions while regenerating misaligned areas with region-specific prompts and selective noise reinitialization. The approach yields substantial gains in text-video alignment on EvalCrafter and T2V-CompBench while maintaining visual quality, and supports iterative refinements for progressive improvement. Overall, VideoRepair provides a practical, model-agnostic pathway to tighten prompt guidance in diffusion-based T2V systems, with implications for more reliable, compositional video generation in real-world applications.

Abstract

Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of two stages: In (1) video refinement planning, we first detect misalignments by generating fine-grained evaluation questions and answering them using an MLLM. Based on video evaluation outputs, we identify accurately generated objects and construct localized prompts to precisely refine misaligned regions. In (2) localized refinement, we enhance video alignment by 'repairing' the misaligned regions from the original video while preserving the correctly generated areas. This is achieved by frame-wise region decomposition using our Region-Preserving Segmentation (RPS) module. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

TL;DR

Abstract

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (24)