Table of Contents
Fetching ...

Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Nikhil Singh, Chih-Wei Wu, Iroro Orife, Mahdi Kalayeh

TL;DR

The paper tackles the problem that perceptually similar visual scenes can be accompanied by different speech, which can confound audiovisual self-supervised learning. It proposes using multilingual dubbed audio as counterfactual data to augment cross-modal contrastive learning, applying it to long-form video content with up to seven audio tracks per title. The approach yields state-of-the-art results on LVU and competitive performance on HEAR benchmarks, while demonstrating that linguistic task performance is not severely compromised. A modular pipeline for generating synthetic counterfactual pairs (LVU-M) is also presented to enable broader examination of speech variation in audiovisual learning. Overall, the work offers a scalable method to build more robust, language-robust audiovisual representations that generalize across domains and tasks.

Abstract

Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video. Our results, from a comprehensive set of experiments investigating different training strategies, show this general approach improves performance on a range of downstream auditory and audiovisual tasks, without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks.

Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

TL;DR

The paper tackles the problem that perceptually similar visual scenes can be accompanied by different speech, which can confound audiovisual self-supervised learning. It proposes using multilingual dubbed audio as counterfactual data to augment cross-modal contrastive learning, applying it to long-form video content with up to seven audio tracks per title. The approach yields state-of-the-art results on LVU and competitive performance on HEAR benchmarks, while demonstrating that linguistic task performance is not severely compromised. A modular pipeline for generating synthetic counterfactual pairs (LVU-M) is also presented to enable broader examination of speech variation in audiovisual learning. Overall, the work offers a scalable method to build more robust, language-robust audiovisual representations that generalize across domains and tasks.

Abstract

Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video. Our results, from a comprehensive set of experiments investigating different training strategies, show this general approach improves performance on a range of downstream auditory and audiovisual tasks, without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks.
Paper Structure (47 sections, 3 equations, 7 figures, 8 tables)

This paper contains 47 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: (Left) Audiovisual scenes can be perceptually similar even as the words spoken in them differ, which may be a challenge for self-supervised audiovisual representation learning. (Right) We propose to leverage movie dubs during training and show that it improves the quality of learned representations on a wide range of tasks.
  • Figure 2: Consider the pictured scene. Which of these dialog examples is more likely? Both are plausible within the scene, yet their phonetic-acoustic characteristics would create differences in the soundtrack.
  • Figure 3: Movies and television episodes included in our pretraining dataset are chosen from a diverse set of original languages and genres. Our goal is to minimize potential content and story biases that could potentially impact our self-supervised models. Note that beyond curating the dataset, we do not use this metadata for representation learning. We normalize per column for visualization.
  • Figure 4: Example clips from our pretraining dataset, showing video stills and mel spectrograms for each of the audio tracks.
  • Figure 5: Pipeline to produce the synthetic counterfactual pairs.
  • ...and 2 more figures