Table of Contents
Fetching ...

We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

Minkyu Choi, S P Sharan, Harsh Goel, Sahil Shah, Sandeep Chinchali

TL;DR

This work tackles the challenge of semantically and temporally coherent text-to-video generation for long, complex prompts. It introduces NeuS-E, a zero-training refinement pipeline that uses neuro-symbolic feedback to identify weak propositions in a generated video, localize the most impactful frames, and perform targeted edits via keyframe adjustments and iterative regeneration guided by a temporal-logic specification. By constructing a video automaton from frame confidences and applying probabilistic model checking against TL specifications, NeuS-E achieves notable improvements in temporal fidelity across open- and closed-source T2V models, with human evaluators aligning with the quantitative gains and showing substantial preference for the refined outputs. The approach demonstrates that structured neuro-symbolic feedback can enhance long-sequence video alignment without retraining, offering a practical pathway to more reliable T2V generation in real-world applications, though limitations remain in existing video generation backbones and evaluation metrics like VBench.

Abstract

Current text-to-video (T2V) generation models are increasingly popular due to their ability to produce coherent videos from textual prompts. However, these models often struggle to generate semantically and temporally consistent videos when dealing with longer, more complex prompts involving multiple objects or sequential events. Additionally, the high computational costs associated with training or fine-tuning make direct improvements impractical. To overcome these limitations, we introduce NeuS-E, a novel zero-training video refinement pipeline that leverages neuro-symbolic feedback to automatically enhance video generation, achieving superior alignment with the prompts. Our approach first derives the neuro-symbolic feedback by analyzing a formal video representation and pinpoints semantically inconsistent events, objects, and their corresponding frames. This feedback then guides targeted edits to the original video. Extensive empirical evaluations on both open-source and proprietary T2V models demonstrate that NeuS-E significantly enhances temporal and logical alignment across diverse prompts by almost 40%

We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

TL;DR

This work tackles the challenge of semantically and temporally coherent text-to-video generation for long, complex prompts. It introduces NeuS-E, a zero-training refinement pipeline that uses neuro-symbolic feedback to identify weak propositions in a generated video, localize the most impactful frames, and perform targeted edits via keyframe adjustments and iterative regeneration guided by a temporal-logic specification. By constructing a video automaton from frame confidences and applying probabilistic model checking against TL specifications, NeuS-E achieves notable improvements in temporal fidelity across open- and closed-source T2V models, with human evaluators aligning with the quantitative gains and showing substantial preference for the refined outputs. The approach demonstrates that structured neuro-symbolic feedback can enhance long-sequence video alignment without retraining, offering a practical pathway to more reliable T2V generation in real-world applications, though limitations remain in existing video generation backbones and evaluation metrics like VBench.

Abstract

Current text-to-video (T2V) generation models are increasingly popular due to their ability to produce coherent videos from textual prompts. However, these models often struggle to generate semantically and temporally consistent videos when dealing with longer, more complex prompts involving multiple objects or sequential events. Additionally, the high computational costs associated with training or fine-tuning make direct improvements impractical. To overcome these limitations, we introduce NeuS-E, a novel zero-training video refinement pipeline that leverages neuro-symbolic feedback to automatically enhance video generation, achieving superior alignment with the prompts. Our approach first derives the neuro-symbolic feedback by analyzing a formal video representation and pinpoints semantically inconsistent events, objects, and their corresponding frames. This feedback then guides targeted edits to the original video. Extensive empirical evaluations on both open-source and proprietary T2V models demonstrate that NeuS-E significantly enhances temporal and logical alignment across diverse prompts by almost 40%

Paper Structure

This paper contains 31 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The teaser demonstrates that the original video generation does not adhere to the prompt temporally. However, after editing the video using $\textit{NeuS-E}$, it aligns with the prompt while preserving well-generated portions. The color of the frame border corresponds to the color of the events and semantics in the text prompt. In the first example at the top, the original video omits the scene where a cyclist reaches a park and enjoys the scenery. However, after editing with $\textit{NeuS-E}$, the park becomes visible, and the cyclist is shown arriving. In the second example at the bottom, the sun remains visible in the original generation. However, $\textit{NeuS-E}$ successfully edits it, causing the sun to set and disappear behind the mountain.
  • Figure 2: Formally verify generated video with video Automaton. The video automaton expands as new frames are added. Once fully constructed, we verify it against the TL specification. We have a very low probability of satisfaction from the initial generation, as the person neither stands nor walks away. To address this, we refine the video using $\textit{NeuS-E}$, generating a better video that is temporally and logically aligned with the prompt and achieves a higher satisfaction probability.
  • Figure 3: Flowchart of video editing pipeline. The entire $\textit{NeuS-E}$ pipeline is three steps. Initially, figure \ref{['fig:f3a']} illustrates the original video generated by Pika. This video is missing 'a sensor detects an error' and 'the system shuts down.' $\textit{NeuS-E}$ takes the key frame and finds what events are missing. It then proposes an edited instruction to the key frame as shown in figure \ref{['fig:f3b']}. Finally, a new video is generated with the modified key frame and instruction. The process repeats until the NeuS-V score goes beyond a given threshold, and the final video is shown in figure \ref{['fig:f3c']}.
  • Figure 4: Human Evaluation on Video Editing. Diverging bar chart of human preference labels on the dataset shows that our editing pipeline improves temporal fidelity.
  • Figure 5: Improvements from Iterative Rounds of Refinement. Distribution of NeuS-V score changes with a violin plot overlay. Green points indicate improvements for (a) Gen-3, (b) Pika-2.2, and (c) CogVideoX-5B.

Theorems & Definitions (2)

  • Definition 1: Video Automaton Construction
  • Definition 2: Satisfaction Probability