Synchformer: Efficient Synchronization from Sparse Cues
Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
TL;DR
This paper tackles audio-visual synchronization in open-world videos with sparse cues by proponing Synchformer, a transformer-based model that decouples feature extraction from the synchronization task through a two-stage training pipeline. It introduces Segment AVCLIP pre-training to learn discriminative segment-level representations and a light synchronization head that can be trained with frozen extractors, enabling large models and scalable training, including on AudioSet. The work additionally contributes interpretability via evidence attribution and a novel synchronizability prediction task, with empirical state-of-the-art results in both dense and sparse settings and demonstrated gains on million-scale data. Overall, the approach enables robust synchronization under sparse cues and supports downstream analyses and large-scale deployment.
Abstract
Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
