Table of Contents
Fetching ...

SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

Ryosuke Matsuda, Keito Kudo, Haruto Yoshida, Nobuyuki Shimizu, Jun Suzuki

Abstract

This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled "high-quality versus low-quality" pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.

SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

Abstract

This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled "high-quality versus low-quality" pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.

Paper Structure

This paper contains 71 sections, 7 equations, 9 figures, 10 tables, 3 algorithms.

Figures (9)

  • Figure 1: Overview of proposed SLVMEval benchmark. We construct human-validated pairs of original and aspect (specifically degraded long videos), and we test various automatic evaluation systems. Human evaluators reliably pick the better video; however, all current automatic evaluation systems lag behind human performance from most perspectives, revealing critical weaknesses in T2LV evaluation.
  • Figure 2: Viewpoints and aspect-specific degrading operations in the proposed SLVMEval benchmark. We organize the benchmark into two groups, i.e., video quality and video-text consistency, and define 10 aspects. For each aspect, we construct paired videos by applying a controlled synthetic degradation to the original long video while keeping all other factors unchanged. The right panels show example pairs. These controlled pairs enable precise meta-evaluation of whether an automatic evaluation system can reliably identify the high-quality video under each viewpoint. Additional example pairs are provided in the supplementary material.
  • Figure 3: Relationship between video duration and accuracy. First, we sorted the dataset by video duration, divided it into four bins (intervals), and computed the accuracy within each bin. The x-axis and y-axis represents the average video duration in each bin and the corresponding accuracy, respectively. For each aspect and automatic evaluation system (excluding the human evaluators), we computed the Spearman rank correlation coefficient $\rho_{\mathrm{S}}$ between video duration and accuracy. Here, we sorted the samples by video duration, divided them into 50 bins, computed the accuracy within each bin, and then measured Spearman’s $\rho_{\mathrm{S}}$ using these 50 accuracy values. The per-aspect Spearman’s $\rho_{\mathrm{S}}$ values and associated $p$-values for each automatic evaluation system are reported in the supplementary material.
  • Figure 4: Relationship between accuracy values before and after filtering on the degraded SLVMEval data. For each aspect, we plot the accuracy of each evaluation system before versus after filtering and compute the Pearson correlation coefficient $\rho_{\mathrm{P}}$ from these points. Each marker corresponds to one evaluation system. The horizontal axis and vertical axis show the accuracy on the unfiltered data and the filtered test set, respectively. The solid line is the fitted linear regression line visualizing the correlation.
  • Figure 5: Example of the annotation interface used for the Object Integrity aspect. The interface presents the prompt and two videos, collects the worker's choice of the higher quality video, and then asks additional questions to verify whether the synthetic degradation for the target aspect has been applied correctly.
  • ...and 4 more figures