Table of Contents
Fetching ...

Transformation of audio embeddings into interpretable, concept-based representations

Alice Zhang, Edison Thomaz, Lie Lu

TL;DR

The paper tackles the interpretability gap of state-of-the-art audio embeddings by proposing a post-hoc method that maps dense CLAP embeddings to sparse, human-interpretable concept-based representations using an overcomplete concept vocabulary. It constructs three 2,000-item vocabularies from FSD50K and applies a sparse, nonnegative embedding decomposition to approximate the original audio embeddings with a concise set of semantic concepts; a fine-tuned variant uses a linear projector to tailor representations to downstream tasks. Across seven datasets and multiple tasks, the concept-based representations match or exceed the performance of dense CLAP embeddings while offering semantic explanations, and sparsity analysis reveals favorable tradeoffs. The work also demonstrates that fine-tuning and vocabulary construction choices can influence zero-shot classification and retrieval results, and it publicly releases three audio-specific vocabularies to facilitate future research in interpretable audio representations and potential concept-based editing or generation.

Abstract

Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio representations. In this work, we explore the semantic interpretability of audio embeddings extracted from these neural networks by leveraging CLAP, a contrastive learning model that brings audio and text into a shared embedding space. We implement a post-hoc method to transform CLAP embeddings into concept-based, sparse representations with semantic interpretability. Qualitative and quantitative evaluations show that the concept-based representations outperform or match the performance of original audio embeddings on downstream tasks while providing interpretability. Additionally, we demonstrate that fine-tuning the concept-based representations can further improve their performance on downstream tasks. Lastly, we publish three audio-specific vocabularies for concept-based interpretability of audio embeddings.

Transformation of audio embeddings into interpretable, concept-based representations

TL;DR

The paper tackles the interpretability gap of state-of-the-art audio embeddings by proposing a post-hoc method that maps dense CLAP embeddings to sparse, human-interpretable concept-based representations using an overcomplete concept vocabulary. It constructs three 2,000-item vocabularies from FSD50K and applies a sparse, nonnegative embedding decomposition to approximate the original audio embeddings with a concise set of semantic concepts; a fine-tuned variant uses a linear projector to tailor representations to downstream tasks. Across seven datasets and multiple tasks, the concept-based representations match or exceed the performance of dense CLAP embeddings while offering semantic explanations, and sparsity analysis reveals favorable tradeoffs. The work also demonstrates that fine-tuning and vocabulary construction choices can influence zero-shot classification and retrieval results, and it publicly releases three audio-specific vocabularies to facilitate future research in interpretable audio representations and potential concept-based editing or generation.

Abstract

Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio representations. In this work, we explore the semantic interpretability of audio embeddings extracted from these neural networks by leveraging CLAP, a contrastive learning model that brings audio and text into a shared embedding space. We implement a post-hoc method to transform CLAP embeddings into concept-based, sparse representations with semantic interpretability. Qualitative and quantitative evaluations show that the concept-based representations outperform or match the performance of original audio embeddings on downstream tasks while providing interpretability. Additionally, we demonstrate that fine-tuning the concept-based representations can further improve their performance on downstream tasks. Lastly, we publish three audio-specific vocabularies for concept-based interpretability of audio embeddings.

Paper Structure

This paper contains 15 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: A diagram of our concept decomposition system illustrates how dense CLAP embeddings (z) are transformed into concept-based representation (w) by solving for a sparse, non-negative linear decomposition over a concept vocabulary (C).
  • Figure 2: Example audios from Clotho with their captions and corresponding concept representation (concept, prominence value) of audio signals. We show the top-3 concepts but the audio embedding decompositions have a total of 35-45 concepts.
  • Figure 3: Distribution of top-5 concepts across two audio classes.
  • Figure 4: Zero-shot classification on multiple datasets as the L1 penalty varies from 0.01 to 0.50, resulting in solutions with L0 norms between $\sim$5-200 and as the vocabulary size varies from 2,000 to 5,000 concepts.
  • Figure 5: Cosine similarity between the concept-based representation and original CLAP embedding.
  • ...and 3 more figures