Table of Contents
Fetching ...

Interpretable Convolutional SyncNet

Sungjoon Park, Jaesub Yun, Donggeon Lee, Minsik Park

TL;DR

This paper tackles aligning audio and video streams in real-world AV data by proposing IC-SyncNet, a convolutional sync-net trained with the BBCE loss to yield interpretable, probabilistic synchronization scores while preserving spatial information. The architecture uses lightweight CNN encoders for a 5-frame mouth region and a mel spectrogram, augmented with anti-aliasing, DropBlock, and a drop-and-tune BN strategy. The BBCE loss enables straightforward probabilistic interpretation of sync, supports multiple negatives, and avoids complex sampling schemes required by InfoNCE. The authors also introduce new sync-quality metrics—offset, probability at offset, and offscreen ratio—derived from the probabilistic outputs to assess AV datasets and active speaker detection. They report state-of-the-art accuracy on LRS2 (96.5%) and LRS3 (93.8%), and demonstrate competitive performance relative to InfoNCE with practical benefits in interpretability and data handling, highlighting significant implications for AV synchronization and lip-sync generation tasks.

Abstract

Because videos in the wild can be out of sync for various reasons, a sync-net is used to bring the video back into sync for tasks that require synchronized videos. Previous state-of-the-art (SOTA) sync-nets use InfoNCE loss, rely on the transformer architecture, or both. Unfortunately, the former makes the model's output difficult to interpret, and the latter is unfriendly with large images, thus limiting the usefulness of sync-nets. In this work, we train a convolutional sync-net using the balanced BCE loss (BBCE), a loss inspired by the binary cross entropy (BCE) and the InfoNCE losses. In contrast to the InfoNCE loss, the BBCE loss does not require complicated sampling schemes. Our model can better handle larger images, and its output can be given a probabilistic interpretation. The probabilistic interpretation allows us to define metrics such as probability at offset and offscreen ratio to evaluate the sync quality of audio-visual (AV) speech datasets. Furthermore, our model achieves SOTA accuracy of $96.5\%$ on the LRS2 dataset and $93.8\%$ on the LRS3 dataset.

Interpretable Convolutional SyncNet

TL;DR

This paper tackles aligning audio and video streams in real-world AV data by proposing IC-SyncNet, a convolutional sync-net trained with the BBCE loss to yield interpretable, probabilistic synchronization scores while preserving spatial information. The architecture uses lightweight CNN encoders for a 5-frame mouth region and a mel spectrogram, augmented with anti-aliasing, DropBlock, and a drop-and-tune BN strategy. The BBCE loss enables straightforward probabilistic interpretation of sync, supports multiple negatives, and avoids complex sampling schemes required by InfoNCE. The authors also introduce new sync-quality metrics—offset, probability at offset, and offscreen ratio—derived from the probabilistic outputs to assess AV datasets and active speaker detection. They report state-of-the-art accuracy on LRS2 (96.5%) and LRS3 (93.8%), and demonstrate competitive performance relative to InfoNCE with practical benefits in interpretability and data handling, highlighting significant implications for AV synchronization and lip-sync generation tasks.

Abstract

Because videos in the wild can be out of sync for various reasons, a sync-net is used to bring the video back into sync for tasks that require synchronized videos. Previous state-of-the-art (SOTA) sync-nets use InfoNCE loss, rely on the transformer architecture, or both. Unfortunately, the former makes the model's output difficult to interpret, and the latter is unfriendly with large images, thus limiting the usefulness of sync-nets. In this work, we train a convolutional sync-net using the balanced BCE loss (BBCE), a loss inspired by the binary cross entropy (BCE) and the InfoNCE losses. In contrast to the InfoNCE loss, the BBCE loss does not require complicated sampling schemes. Our model can better handle larger images, and its output can be given a probabilistic interpretation. The probabilistic interpretation allows us to define metrics such as probability at offset and offscreen ratio to evaluate the sync quality of audio-visual (AV) speech datasets. Furthermore, our model achieves SOTA accuracy of on the LRS2 dataset and on the LRS3 dataset.
Paper Structure (20 sections, 18 equations, 4 figures, 5 tables)

This paper contains 20 sections, 18 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) Convolution block used in the encoders. (b) Summary of training our sync-net.
  • Figure 2: Active speaker detection (from AVSpeech dataset cheng2020look). (a) The probability of synchronization computed using a model trained on the LRS2 dataset. The x axis is the time step for images, and the y axis is the offset of audio with respect to the images. Offset for this video is indicated with red arrow. (b) The person caught on screen. The probability at offset is $0.29$ and the offscreen ratio is $0.48$.
  • Figure 3: Failure cases. Offset for (a) is -1 and offset for (b) is -3. In (b), the person does not speak for the duration of the parallelogram regions.
  • Figure 4: Lip-sync quality of various AV speech datasets.