Table of Contents
Fetching ...

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

Yiwen Song, Tomas Pfister, Yale Song

Abstract

Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

Abstract

Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.
Paper Structure (52 sections, 7 equations, 14 figures, 7 tables)

This paper contains 52 sections, 7 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Videos generated by VQQA, showing improvements in categories such as numeracy, interaction, dynamic attributes, and spatial relationship.
  • Figure 2: Qualitative example of VQQA iterative refinement process. Utilizing low-scoring QA pairs to isolate visual flaws, VQQAconstructs an optimized prompt that successfully mitigates the localized artifacts in the next generation.
  • Figure 3: The VQQA framework: Given generation conditions $C$ and a prompt $p_t$, the model $M$ produces a video $v_t$. The multi-agent framework uses a Question Generation (QG) agent to formulate visual queries $Q$ and a Question Answering (QA) agent to evaluate the video and produce a score report. These outputs inform the Prompt Refinement (PR) agent, which uses semantic gradient to update the prompt for the next iteration. Finally, a Global VLM Rater assesses the candidate set of generated videos against the original conditions to select the optimal video $v^*$.
  • Figure 4: Convergence analysis of VQQA on T2V-CompBench. Evaluations are performed for CogVideoX-5B generations using Gemini-3-Pro. (a) Correlations between the maximum global score $S^*_t$ and the T2V-CompBench metric across optimization steps. The blue shaded region indicates $\pm 1$ standard deviation of $S^*_t$ across the 1400 evaluated samples. The detailed performance breakdown for each individual category is provided in Appendix \ref{['subsec:long_horizon_runs']}. (b) Sensitivity analysis of average iterations to converge given the patience window $k$ and saturation threshold $\epsilon$ (\ref{['saturation_eq']}).
  • Figure 5: Ablation Study on Global Selection Mechanism. The number of prompt optimization rounds for VQQAis fixed at N = 4.
  • ...and 9 more figures