Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

Tianle Zhang; Langtian Ma; Yuchen Yan; Yuchen Zhang; Kai Wang; Yue Yang; Ziyao Guo; Wenqi Shao; Yang You; Yu Qiao; Ping Luo; Kaipeng Zhang

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

Tianle Zhang, Langtian Ma, Yuchen Yan, Yuchen Zhang, Kai Wang, Yue Yang, Ziyao Guo, Wenqi Shao, Yang You, Yu Qiao, Ping Luo, Kaipeng Zhang

TL;DR

The Text-to-Video Human Evaluation (T2VHE) protocol is introduced, a comprehensive and standardized protocol for T2V models that includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module.

Abstract

Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. However, existing manual evaluation protocols face reproducibility, reliability, and practicality issues. To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models. The T2VHE protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module. Experimental results demonstrate that this protocol not only ensures high-quality annotations but can also reduce evaluation costs by nearly 50\%. We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code. This will help communities establish more sophisticated human assessment protocols.

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

TL;DR

Abstract

Paper Structure (30 sections, 6 equations, 9 figures, 15 tables, 1 algorithm)

This paper contains 30 sections, 6 equations, 9 figures, 15 tables, 1 algorithm.

Introduction
Related work
Human evaluation in video generation
Our protocol for text-to-vedio models
Evaluation metrics
Evaluation method
Evaluators
Dynamic evaluation module
Human evaluation of existing models
Settings
Evaluation results
Module validation
Limitations
Conclusion
Author contributions
...and 15 more sections

Figures (9)

Figure 1: (a) An illustration of our human evaluation protocol. (b) The annotation interface, wherein annotators choose the superior video based on provided evaluation metrics. (c) Instruction and examples to guide used to the "Video Quality" evaluation.
Figure 2: Scores and rankings of models across various dimensions for pre-training LRAs, AMT Annotators, and Post-training LRAs. Post-training LRAs (Dyn) refers to the annotation results of Post-training LRAs using the dynamic evaluation component.
Figure 3: The left figure shows how the number of annotations required for different protocols. The right figure represents model score estimations across different metrics. Each boxplot illustrates the median, interquartile range, and 95% confidence intervals of the estimates.
Figure 4: Instruction and examples to guide used to the "Temporal Quality" evaluation.
Figure 5: Instruction and examples to guide used to the "Motion Quality" evaluation.
...and 4 more figures

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

TL;DR

Abstract

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

Authors

TL;DR

Abstract

Table of Contents

Figures (9)