Table of Contents
Fetching ...

PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

Lucas Goncalves, Prashant Mathur, Chandrashekhar Lavania, Metehan Cekic, Marcello Federico, Kyu J. Han

TL;DR

PEAVS tackles the need for a perceptually aligned, reference-free metric for audio-visual synchrony in real-world videos. It introduces an AVS benchmark with over 100 hours and 120K human annotations, and a cross-modal transformer-based PEAVS model trained in two stages (contrastive pretraining and CCC-optimized fine-tuning) to predict a 1–5 synchrony score. PEAVS achieves state-of-the-art alignment with human judgments (set-level $r=0.79$) and outperforms Fréchet-based baselines, with ablations confirming the importance of pretraining and cross-modal interactions. The work provides a valuable dataset and metric for evaluating AV generative models, facilitating more perceptually faithful synchronization in the wild.

Abstract

Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately, there is a lack of metrics that offer a quantitative and interpretable measure of audio-visual synchronization for videos "in the wild". To address this gap, we first created a large scale human annotated dataset (100+ hrs) representing nine types of synchronization errors in audio-visual content and how human perceive them. We then developed a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization. We validate PEAVS using a newly generated dataset, achieving a Pearson correlation of 0.79 at the set level and 0.54 at the clip level when compared to human labels. In our experiments, we observe a relative gain 50% over a natural extension of Fréchet based metrics for Audio-Visual synchrony, confirming PEAVS efficacy in objectively modeling subjective perceptions of audio-visual synchronization for videos "in the wild".

PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

TL;DR

PEAVS tackles the need for a perceptually aligned, reference-free metric for audio-visual synchrony in real-world videos. It introduces an AVS benchmark with over 100 hours and 120K human annotations, and a cross-modal transformer-based PEAVS model trained in two stages (contrastive pretraining and CCC-optimized fine-tuning) to predict a 1–5 synchrony score. PEAVS achieves state-of-the-art alignment with human judgments (set-level ) and outperforms Fréchet-based baselines, with ablations confirming the importance of pretraining and cross-modal interactions. The work provides a valuable dataset and metric for evaluating AV generative models, facilitating more perceptually faithful synchronization in the wild.

Abstract

Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately, there is a lack of metrics that offer a quantitative and interpretable measure of audio-visual synchronization for videos "in the wild". To address this gap, we first created a large scale human annotated dataset (100+ hrs) representing nine types of synchronization errors in audio-visual content and how human perceive them. We then developed a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization. We validate PEAVS using a newly generated dataset, achieving a Pearson correlation of 0.79 at the set level and 0.54 at the clip level when compared to human labels. In our experiments, we observe a relative gain 50% over a natural extension of Fréchet based metrics for Audio-Visual synchrony, confirming PEAVS efficacy in objectively modeling subjective perceptions of audio-visual synchronization for videos "in the wild".
Paper Structure (31 sections, 3 equations, 17 figures, 5 tables)

This paper contains 31 sections, 3 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: We compare absolute differences in annotation score across the distortion types. In this plot, x-axis shows distortion types that were compared, i.e., '2 4' represents distortion type 2 v/s 4 and y-axis represents the difference in scores across annotation tasks. For ID to distortion type mapping see Table \ref{['tab:type_values']}.
  • Figure 2: Audio Shift
  • Figure 3: Audio Speed Up
  • Figure 4: Audio Speed Down
  • Figure 6: Framework Overview
  • ...and 12 more figures