Table of Contents
Fetching ...

Synchformer: Efficient Synchronization from Sparse Cues

Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

TL;DR

This paper tackles audio-visual synchronization in open-world videos with sparse cues by proponing Synchformer, a transformer-based model that decouples feature extraction from the synchronization task through a two-stage training pipeline. It introduces Segment AVCLIP pre-training to learn discriminative segment-level representations and a light synchronization head that can be trained with frozen extractors, enabling large models and scalable training, including on AudioSet. The work additionally contributes interpretability via evidence attribution and a novel synchronizability prediction task, with empirical state-of-the-art results in both dense and sparse settings and demonstrated gains on million-scale data. Overall, the approach enables robust synchronization under sparse cues and supports downstream analyses and large-scale deployment.

Abstract

Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.

Synchformer: Efficient Synchronization from Sparse Cues

TL;DR

This paper tackles audio-visual synchronization in open-world videos with sparse cues by proponing Synchformer, a transformer-based model that decouples feature extraction from the synchronization task through a two-stage training pipeline. It introduces Segment AVCLIP pre-training to learn discriminative segment-level representations and a light synchronization head that can be trained with frozen extractors, enabling large models and scalable training, including on AudioSet. The work additionally contributes interpretability via evidence attribution and a novel synchronizability prediction task, with empirical state-of-the-art results in both dense and sparse settings and demonstrated gains on million-scale data. Overall, the approach enables robust synchronization under sparse cues and supports downstream analyses and large-scale deployment.

Abstract

Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
Paper Structure (16 sections, 4 equations, 4 figures, 5 tables)

This paper contains 16 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Synchformer ($\mathcal{M}$). The audio and visual streams ($A, V$) are split into $S$ segments of equal duration. Then, the segment-level inputs are fed into their respective feature extractors ($\mathcal{F}_a, \mathcal{F}_v$). The streams are aggregated along the space (or frequency) by $\mathcal{G}_a, \mathcal{G}_v$, and concatenated into a single sequence with auxiliary tokens (CLS and SEP). The sequence is fed into the synchronization module $\mathcal{T}$, which predicts the temporal offset $\hat{\Delta}$. The dashed lines show the training of the model.
  • Figure 2: Segment AVCLIP Pre-training. The audio ($A$) and visual ($V$) streams are split into $S$ segments, which are fed into their respective feature extractors ($\mathcal{F}_a, \mathcal{F}_v$). The outputs of the feature extractors ($\overline{a}_s, \overline{v}_s$) are aggregated along time (omitted for clarity) to obtain audio and visual features ($\tilde{a}_s, \tilde{v}_s$). The features from corresponding segments in a video are pulled together ($\uparrow$), while the features from other segments are pushed apart ($\downarrow$).
  • Figure 3: Visualization of evidence attribution. The moment of 'hitting the ball' and the ground truth 'offset' are highlighted in both streams.
  • Figure 4: Predicting synchronizability with Synchformer.Left: ROC curve. Right: the synchronization performance on videos that were ranked by the synchronizability model. The results are reported on VGGSound-Sparse.