Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset
Iya Chivileva, Philip Lynch, Tomas E. Ward, Alan F. Smeaton
TL;DR
The paper tackles the challenge of reliably evaluating text-to-video (T2V) outputs by analyzing the limitations of common metrics and comparing them to human judgments.It introduces an open dataset of 1,005 videos generated by 5 recent T2V models, paired with extensive human quality assessments on alignment and perceptual realism.A new Ensemble Video Quality Metric is proposed, combining a text-similarity component (BLIP-2 captions with a BERT/Cosine fusion at a 0.75:0.25 ratio) and a learned naturalness score via an XGBoost classifier, trained against human scores.Findings indicate partial correspondence between automatic metrics and human judgments, but no single metric fully captures naturalness and semantic alignment, underscoring the continued value of human evaluation alongside automatic proxies.
Abstract
Evaluating the quality of videos generated from text-to-video (T2V) models is important if they are to produce plausible outputs that convince a viewer of their authenticity. We examine some of the metrics used in this area and highlight their limitations. The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied. We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared. The contribution is an assessment of commonly used quality metrics, and a comparison of their performances and the performance of human evaluations on an open dataset of T2V videos. Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.
