Q-Save: Towards Scoring and Attribution for Generated Video Evaluation
Xiele Wu, Zicheng Zhang, Mingtao Chen, Yixian Liu, Yiming Liu, Shushi Wang, Zhichao Hu, Yuhong Liu, Guangtao Zhai, Xiaohong Liu
TL;DR
Q-Save addresses the need for unified, explainable evaluation of AI-generated videos by introducing a large-scale dataset (~10k videos) with MOS and fine-grained attributions across visual quality, dynamic quality, and text-video alignment, and by proposing a single model that jointly scores and explains across these dimensions. The framework combines SlowFast-based video processing, Chain-of-Thought prompts, and a three-stage training pipeline (SFT → GRPO → SFT) with a dual-supervision loss to achieve state-of-the-art instance- and model-level correlations ($SRCC$, $PLCC$) while delivering human-aligned justifications. Cross-dataset validation demonstrates strong generalization to out-of-domain benchmarks, underscoring the practical utility of unified, interpretable video evaluation. The work lays a foundation for explainable multimodal video evaluation and provides a pathway for community adoption through dataset and code release. Overall, Q-Save advances trustworthy AI in generative video by enabling precise quality assessment and transparent, rationale-based explanations.
Abstract
We present Q-Save, a new benchmark dataset and model for holistic and explainable evaluation of AI-generated video (AIGV) quality. The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels along three core dimensions: visual quality, dynamic quality, and text-video alignment. These multi-aspect annotations enable both accurate quality assessment and interpretable reasoning behind the scores. To leverage this data, we propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation. The model adopts the SlowFast framework to distinguish between fast frames and slow frames - slow frames are processed with high resolution while fast frames use low resolution, balancing evaluation accuracy and computational efficiency. For training, we use data formatted in Chain-of-Thought (COT) style and employ a multi-stage strategy: we first conduct Supervised Fine-Tuning (SFT), then further enhance the model with Grouped Relative Policy Optimization (GRPO), and finally perform SFT again to improve model stability. Experimental results demonstrate that our model achieves state-of-the-art performance in video quality prediction while also providing human-aligned, interpretable justifications. Our dataset and model establish a strong foundation for explainable evaluation in generative video research, contributing to the development of multimodal generation and trustworthy AI. Code and dataset will be released upon publication.
