Interpretable Convolutional SyncNet
Sungjoon Park, Jaesub Yun, Donggeon Lee, Minsik Park
TL;DR
This paper tackles aligning audio and video streams in real-world AV data by proposing IC-SyncNet, a convolutional sync-net trained with the BBCE loss to yield interpretable, probabilistic synchronization scores while preserving spatial information. The architecture uses lightweight CNN encoders for a 5-frame mouth region and a mel spectrogram, augmented with anti-aliasing, DropBlock, and a drop-and-tune BN strategy. The BBCE loss enables straightforward probabilistic interpretation of sync, supports multiple negatives, and avoids complex sampling schemes required by InfoNCE. The authors also introduce new sync-quality metrics—offset, probability at offset, and offscreen ratio—derived from the probabilistic outputs to assess AV datasets and active speaker detection. They report state-of-the-art accuracy on LRS2 (96.5%) and LRS3 (93.8%), and demonstrate competitive performance relative to InfoNCE with practical benefits in interpretability and data handling, highlighting significant implications for AV synchronization and lip-sync generation tasks.
Abstract
Because videos in the wild can be out of sync for various reasons, a sync-net is used to bring the video back into sync for tasks that require synchronized videos. Previous state-of-the-art (SOTA) sync-nets use InfoNCE loss, rely on the transformer architecture, or both. Unfortunately, the former makes the model's output difficult to interpret, and the latter is unfriendly with large images, thus limiting the usefulness of sync-nets. In this work, we train a convolutional sync-net using the balanced BCE loss (BBCE), a loss inspired by the binary cross entropy (BCE) and the InfoNCE losses. In contrast to the InfoNCE loss, the BBCE loss does not require complicated sampling schemes. Our model can better handle larger images, and its output can be given a probabilistic interpretation. The probabilistic interpretation allows us to define metrics such as probability at offset and offscreen ratio to evaluate the sync quality of audio-visual (AV) speech datasets. Furthermore, our model achieves SOTA accuracy of $96.5\%$ on the LRS2 dataset and $93.8\%$ on the LRS3 dataset.
