Table of Contents
Fetching ...

CATSE: A Context-Aware Framework for Causal Target Sound Extraction

Shrishail Baligar, Mikolaj Kegler, Bryce Irvin, Marko Stamenovic, Shawn Newsam

TL;DR

This work introduces a family of context-aware low-latency causal TSE models suitable for real-time processing and shows that the proposed model outperforms size- and latency-matched Waveformer, a state-of-the-art model for real-time TSE.

Abstract

Target Sound Extraction (TSE) focuses on the problem of separating sources of interest, indicated by a user's cue, from the input mixture. Most existing solutions operate in an offline fashion and are not suited to the low-latency causal processing constraints imposed by applications in live-streamed content such as augmented hearing. We introduce a family of context-aware low-latency causal TSE models suitable for real-time processing. First, we explore the utility of context by providing the TSE model with oracle information about what sound classes make up the input mixture, where the objective of the model is to extract one or more sources of interest indicated by the user. Since the practical applications of oracle models are limited due to their assumptions, we introduce a composite multi-task training objective involving separation and classification losses. Our evaluation involving single- and multi-source extraction shows the benefit of using context information in the model either by means of providing full context or via the proposed multi-task training loss without the need for full context information. Specifically, we show that our proposed model outperforms size- and latency-matched Waveformer, a state-of-the-art model for real-time TSE.

CATSE: A Context-Aware Framework for Causal Target Sound Extraction

TL;DR

This work introduces a family of context-aware low-latency causal TSE models suitable for real-time processing and shows that the proposed model outperforms size- and latency-matched Waveformer, a state-of-the-art model for real-time TSE.

Abstract

Target Sound Extraction (TSE) focuses on the problem of separating sources of interest, indicated by a user's cue, from the input mixture. Most existing solutions operate in an offline fashion and are not suited to the low-latency causal processing constraints imposed by applications in live-streamed content such as augmented hearing. We introduce a family of context-aware low-latency causal TSE models suitable for real-time processing. First, we explore the utility of context by providing the TSE model with oracle information about what sound classes make up the input mixture, where the objective of the model is to extract one or more sources of interest indicated by the user. Since the practical applications of oracle models are limited due to their assumptions, we introduce a composite multi-task training objective involving separation and classification losses. Our evaluation involving single- and multi-source extraction shows the benefit of using context information in the model either by means of providing full context or via the proposed multi-task training loss without the need for full context information. Specifically, we show that our proposed model outperforms size- and latency-matched Waveformer, a state-of-the-art model for real-time TSE.
Paper Structure (11 sections, 3 equations, 2 figures, 2 tables)

This paper contains 11 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the proposed causal Context-Aware TSE models. (a) pcTCN applies conditioning to the separator. (b) eCATSE integrates oracle context in addition to the hint. (c) iCATSE achieves context awareness during multi-task training through classification heads, which are not used for inference.
  • Figure 2: Our proposed TCN-based separator performs conditioning at every Conv1D layer of the three TCNs, for a total of 18 locations. This pervasive conditioning is performed in all three proposed models: pcTCN, eCATSE, and iCATSE.