Table of Contents
Fetching ...

Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations

Jaeyeon Kim, Injune Hwang, Kyogu Lee

TL;DR

This work tackles semantic learning from raw audio by combining contextual and phonetic speech representations in a dual-channel transformer framework. It introduces a speech-to-unit processing pipeline that aligns discrete contextual and phonetic units and two masking objectives, MCR and MCP, to drive semantic understanding. Across sSIMI and Fluent Speech Command evaluations, the approach demonstrates improved semantic alignment and SLU performance, with ablations confirming the value of dual-channel interactions and the masking tasks across multiple representations. The results suggest that leveraging cross-resolution representations from raw audio can enhance semantic abstraction without text supervision, offering a practical path for more robust spoken language models.

Abstract

We propose a framework to learn semantics from raw audio signals using two types of representations, encoding contextual and phonetic information respectively. Specifically, we introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions. For the language model, we adopt a dual-channel architecture to incorporate both types of representation. We also present new training objectives, masked context reconstruction and masked context prediction, that push models to learn semantics effectively. Experiments on the sSIMI metric of Zero Resource Speech Benchmark 2021 and Fluent Speech Command dataset show our framework learns semantics better than models trained with only one type of representation.

Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations

TL;DR

This work tackles semantic learning from raw audio by combining contextual and phonetic speech representations in a dual-channel transformer framework. It introduces a speech-to-unit processing pipeline that aligns discrete contextual and phonetic units and two masking objectives, MCR and MCP, to drive semantic understanding. Across sSIMI and Fluent Speech Command evaluations, the approach demonstrates improved semantic alignment and SLU performance, with ablations confirming the value of dual-channel interactions and the masking tasks across multiple representations. The results suggest that leveraging cross-resolution representations from raw audio can enhance semantic abstraction without text supervision, offering a practical path for more robust spoken language models.

Abstract

We propose a framework to learn semantics from raw audio signals using two types of representations, encoding contextual and phonetic information respectively. Specifically, we introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions. For the language model, we adopt a dual-channel architecture to incorporate both types of representation. We also present new training objectives, masked context reconstruction and masked context prediction, that push models to learn semantics effectively. Experiments on the sSIMI metric of Zero Resource Speech Benchmark 2021 and Fluent Speech Command dataset show our framework learns semantics better than models trained with only one type of representation.
Paper Structure (11 sections, 1 equation, 1 figure, 3 tables)

This paper contains 11 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Example of speech-to-unit processing and the overall architecture of our framework. Integer values denote the discrete contextual and phonetic units