Table of Contents
Fetching ...

Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech

Adrien Pupier, Maximin Coavoux, Jérôme Goulian, Benjamin Lecouteux

TL;DR

This work investigates end-to-end dependency parsing directly from speech, aiming to leverage prosody and bypass transcription errors inherent in ASR pipelines. It introduces a graph-based parsing architecture that operates on audio-derived word embeddings and compares it to a sequence-labeling parser and to pipeline baselines on the Orféo French spoken corpus. Across multiple settings, the graph-based model generally outperforms alternatives, and end-to-end speech parsing can surpass pipeline approaches, especially as ASR quality improves. The study highlights word-level speech representations as the key bottleneck and suggests future work to enhance segmentation and expand to more languages beyond French.

Abstract

Direct dependency parsing of the speech signal -- as opposed to parsing speech transcriptions -- has recently been proposed as a task (Pupier et al. 2022), as a way of incorporating prosodic information in the parsing system and bypassing the limitations of a pipeline approach that would consist of using first an Automatic Speech Recognition (ASR) system and then a syntactic parser. In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. We perform this evaluation on a large treebank of spoken French, featuring realistic spontaneous conversations. Our findings show that (i) the graph based approach obtain better results across the board (ii) parsing directly from speech outperforms a pipeline approach, despite having 30% fewer parameters.

Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech

TL;DR

This work investigates end-to-end dependency parsing directly from speech, aiming to leverage prosody and bypass transcription errors inherent in ASR pipelines. It introduces a graph-based parsing architecture that operates on audio-derived word embeddings and compares it to a sequence-labeling parser and to pipeline baselines on the Orféo French spoken corpus. Across multiple settings, the graph-based model generally outperforms alternatives, and end-to-end speech parsing can surpass pipeline approaches, especially as ASR quality improves. The study highlights word-level speech representations as the key bottleneck and suggests future work to enhance segmentation and expand to more languages beyond French.

Abstract

Direct dependency parsing of the speech signal -- as opposed to parsing speech transcriptions -- has recently been proposed as a task (Pupier et al. 2022), as a way of incorporating prosodic information in the parsing system and bypassing the limitations of a pipeline approach that would consist of using first an Automatic Speech Recognition (ASR) system and then a syntactic parser. In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. We perform this evaluation on a large treebank of spoken French, featuring realistic spontaneous conversations. Our findings show that (i) the graph based approach obtain better results across the board (ii) parsing directly from speech outperforms a pipeline approach, despite having 30% fewer parameters.
Paper Structure (15 sections, 1 figure, 5 tables)