Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset

Iya Chivileva; Philip Lynch; Tomas E. Ward; Alan F. Smeaton

Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset

Iya Chivileva, Philip Lynch, Tomas E. Ward, Alan F. Smeaton

TL;DR

The paper tackles the challenge of reliably evaluating text-to-video (T2V) outputs by analyzing the limitations of common metrics and comparing them to human judgments.It introduces an open dataset of 1,005 videos generated by 5 recent T2V models, paired with extensive human quality assessments on alignment and perceptual realism.A new Ensemble Video Quality Metric is proposed, combining a text-similarity component (BLIP-2 captions with a BERT/Cosine fusion at a 0.75:0.25 ratio) and a learned naturalness score via an XGBoost classifier, trained against human scores.Findings indicate partial correspondence between automatic metrics and human judgments, but no single metric fully captures naturalness and semantic alignment, underscoring the continued value of human evaluation alongside automatic proxies.

Abstract

Evaluating the quality of videos generated from text-to-video (T2V) models is important if they are to produce plausible outputs that convince a viewer of their authenticity. We examine some of the metrics used in this area and highlight their limitations. The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied. We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared. The contribution is an assessment of commonly used quality metrics, and a comparison of their performances and the performance of human evaluations on an open dataset of T2V videos. Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.

Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset

TL;DR

Abstract

Paper Structure (10 sections, 1 equation, 10 figures, 2 tables)

This paper contains 10 sections, 1 equation, 10 figures, 2 tables.

Introduction
Related Work
Text-to-Video Models
Evaluation Metrics
Image Naturalness
An Ensemble Video Quality Metric
Evaluating Image Naturalness
Evaluating Text Similarity
Evaluation
Conclusions

Figures (10)

Figure 1: Example limitations of existing T2V quality metrics.
Figure 2: Image naturalness assessment with NIQE (N) and BRISQUE (B) scores.
Figure 3: T2V-CL metric ensemble
Figure 4: Frames from a generated video with the prompt "A golden retriever eating ice cream on a beautiful tropical beach at sunset". Note that 2 of the frames are missing the dog.
Figure 5: Samples from our generated videos -- rows show frames generated by Text2Video-Zero, Text-to-Video Synthesis, Tune-a-Video, Aphantasia and Video Fusion respectively while columns are frames from the same text prompts.
...and 5 more figures

Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset

TL;DR

Abstract

Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (10)