Table of Contents
Fetching ...

Multi-View Multi-Task Modeling with Speech Foundation Models for Speech Forensic Tasks

Orchid Chetia Phukan, Devyani Koshal, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma

TL;DR

This study explores the various state-of-the-art (SOTA) SFMs by extracting their representations for learning these SFTs and investigates their effectiveness at each task specifically, and introduces a new framework called TANGO (Task Alignment with iNter-view Gated Optimal transport) to implement this approach.

Abstract

Speech forensic tasks (SFTs), such as automatic speaker recognition (ASR), speech emotion recognition (SER), gender recognition (GR), and age estimation (AE), find use in different security and biometric applications. Previous works have applied various techniques, with recent studies focusing on applying speech foundation models (SFMs) for improved performance. However, most prior efforts have centered on building individual models for each task separately, despite the inherent similarities among these tasks. This isolated approach results in higher computational resource requirements, increased costs, time consumption, and maintenance challenges. In this study, we address these challenges by employing a multi-task learning strategy. Firstly, we explore the various state-of-the-art (SOTA) SFMs by extracting their representations for learning these SFTs and investigating their effectiveness at each task specifically. Secondly, we analyze the performance of the extracted representations on the SFTs in a multi-task learning framework. We observe a decline in performance when SFTs are modeled together compared to individual task-specific models, and as a remedy, we propose multi-view learning (MVL). Views are representations from different SFMs transformed into distinct abstract spaces by characteristics unique to each SFM. By leveraging MVL, we integrate these diverse representations to capture complementary information across tasks, enhancing the shared learning process. We introduce a new framework called TANGO (Task Alignment with iNter-view Gated Optimal transport) to implement this approach. With TANGO, we achieve the topmost performance in comparison to individual SFM representations as well as baseline fusion techniques across benchmark datasets such as CREMA-D, emo-DB, and BAVED.

Multi-View Multi-Task Modeling with Speech Foundation Models for Speech Forensic Tasks

TL;DR

This study explores the various state-of-the-art (SOTA) SFMs by extracting their representations for learning these SFTs and investigates their effectiveness at each task specifically, and introduces a new framework called TANGO (Task Alignment with iNter-view Gated Optimal transport) to implement this approach.

Abstract

Speech forensic tasks (SFTs), such as automatic speaker recognition (ASR), speech emotion recognition (SER), gender recognition (GR), and age estimation (AE), find use in different security and biometric applications. Previous works have applied various techniques, with recent studies focusing on applying speech foundation models (SFMs) for improved performance. However, most prior efforts have centered on building individual models for each task separately, despite the inherent similarities among these tasks. This isolated approach results in higher computational resource requirements, increased costs, time consumption, and maintenance challenges. In this study, we address these challenges by employing a multi-task learning strategy. Firstly, we explore the various state-of-the-art (SOTA) SFMs by extracting their representations for learning these SFTs and investigating their effectiveness at each task specifically. Secondly, we analyze the performance of the extracted representations on the SFTs in a multi-task learning framework. We observe a decline in performance when SFTs are modeled together compared to individual task-specific models, and as a remedy, we propose multi-view learning (MVL). Views are representations from different SFMs transformed into distinct abstract spaces by characteristics unique to each SFM. By leveraging MVL, we integrate these diverse representations to capture complementary information across tasks, enhancing the shared learning process. We introduce a new framework called TANGO (Task Alignment with iNter-view Gated Optimal transport) to implement this approach. With TANGO, we achieve the topmost performance in comparison to individual SFM representations as well as baseline fusion techniques across benchmark datasets such as CREMA-D, emo-DB, and BAVED.

Paper Structure

This paper contains 14 sections, 4 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Single view models: (a) single view single task and (b) single view multi-task.
  • Figure 2: Multi-view models: (a) Multi View Multi Task with concatenation fusion and (b) TANGO. Here, X11 and X22 denote features from two different views, while X12 and X21 represent features transported from view 2 to the view 1 network and from view 1 to the view 2 network, respectively.
  • Figure 3: Age and Gender Distribution for CREMA-D.
  • Figure 4: Age and Gender Distribution for BAVED.
  • Figure 5: Age and Gender Distribution for emo-DB.
  • ...and 2 more figures