Table of Contents
Fetching ...

Dissecting Temporal Understanding in Text-to-Audio Retrieval

Andreea-Maria Oncescu, João F. Henriques, A. Sophia Koepke

TL;DR

This work dissects the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets, and introduces a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models.

Abstract

Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of events. Code and data are available at https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/.

Dissecting Temporal Understanding in Text-to-Audio Retrieval

TL;DR

This work dissects the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets, and introduces a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models.

Abstract

Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of events. Code and data are available at https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/.
Paper Structure (17 sections, 4 equations, 6 figures, 6 tables)

This paper contains 17 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Analysing and improving the understanding of temporal cues in text-to-audio retrieval model. Top left: AudioCaps$^{uni}$ is a variant of the AudioCaps dataset with modified descriptions that have a more uniform distribution of textual temporal cues. Additionally (bottom left), we generate test sets with reversed temporal ordering or replaced temporal conjunctions (TempTest$^{rev}$ and TempTest$^{rep}$). Middle: Generation of audio-text pairs in the SynCaps dataset that provides a controlled evaluation setting. Right: Our text-text contrastive loss improves temporal understanding using positive descriptions (green) for the same sound ordering and negative examples (red) for the opposite temporal meaning.
  • Figure 2: Distribution of temporal conjunctions and prepositions in the full AudioCaps kim2019audiocaps dataset. Most temporal sentences contain future temporal cues, such as 'Followed by'. There is only a small proportion of past cues, e.g. 'Before'.
  • Figure 3: Distribution of temporal conjunctions and prepositions in the full Clotho drossos2020clotho dataset. Most temporal sentences contain joint cues (e.g. 'As', 'While'), followed by future ones (e.g. 'Then'). Fewer sentences contain past cues (e.g. 'Before').
  • Figure 4: Distribution of temporal conjunctions and prepositions in the AudioCaps training data. We compare the proportion of temporal textual cues in the original training dataset (Train) and our proposed variant with a more uniform distribution of temporal textual cues ($Train^{uni}$).
  • Figure 5: Distribution of temporal conjunctions and prepositions in the AudioCaps kim2019audiocaps validation dataset (Val) compared to our proposed variant with a more uniform distribution of temporal textual cues ($Val^{uni}$).
  • ...and 1 more figures