Table of Contents
Fetching ...

Towards Event Extraction from Speech with Contextual Clues

Jingqi Kang, Tongtong Wu, Jinming Zhao, Guitao Wang, Guilin Qi, Yuan-Fang Li, Gholamreza Haffari

TL;DR

This paper defines Speech Event Extraction (SpeechEE), an end-to-end task that derives semantic events directly from raw speech. It introduces RISEN, a sequence-to-structure generator that conditions event output on ASR transcripts and adopts a flat event serialization to better align with speech, achieving strong gains over baselines. To enable research, three synthetic English/Chinese datasets (Speech-ACE05*, Speech-MAVEN*, Speech-DuEE) and a real-speech test set (Human-MAVEN) are constructed and evaluated with multiple baselines. The results show that contextual clues from transcripts substantially improve end-to-end SpeechEE, with up to $10.7$ percentage-point gains, and highlight the advantages of the flat-format representation for speech-driven event extraction, indicating a promising direction for speech information extraction.

Abstract

While text-based event extraction has been an active research area and has seen successful application in many domains, extracting semantic events from speech directly is an under-explored problem. In this paper, we introduce the Speech Event Extraction (SpeechEE) task and construct three synthetic training sets and one human-spoken test set. Compared to event extraction from text, SpeechEE poses greater challenges mainly due to complex speech signals that are continuous and have no word boundaries. Additionally, unlike perceptible sound events, semantic events are more subtle and require a deeper understanding. To tackle these challenges, we introduce a sequence-to-structure generation paradigm that can produce events from speech signals in an end-to-end manner, together with a conditioned generation method that utilizes speech recognition transcripts as the contextual clue. We further propose to represent events with a flat format to make outputs more natural language-like. Our experimental results show that our method brings significant improvements on all datasets, achieving a maximum F1 gain of 10.7%. The code and datasets are released on https://github.com/jodie-kang/SpeechEE.

Towards Event Extraction from Speech with Contextual Clues

TL;DR

This paper defines Speech Event Extraction (SpeechEE), an end-to-end task that derives semantic events directly from raw speech. It introduces RISEN, a sequence-to-structure generator that conditions event output on ASR transcripts and adopts a flat event serialization to better align with speech, achieving strong gains over baselines. To enable research, three synthetic English/Chinese datasets (Speech-ACE05*, Speech-MAVEN*, Speech-DuEE) and a real-speech test set (Human-MAVEN) are constructed and evaluated with multiple baselines. The results show that contextual clues from transcripts substantially improve end-to-end SpeechEE, with up to percentage-point gains, and highlight the advantages of the flat-format representation for speech-driven event extraction, indicating a promising direction for speech information extraction.

Abstract

While text-based event extraction has been an active research area and has seen successful application in many domains, extracting semantic events from speech directly is an under-explored problem. In this paper, we introduce the Speech Event Extraction (SpeechEE) task and construct three synthetic training sets and one human-spoken test set. Compared to event extraction from text, SpeechEE poses greater challenges mainly due to complex speech signals that are continuous and have no word boundaries. Additionally, unlike perceptible sound events, semantic events are more subtle and require a deeper understanding. To tackle these challenges, we introduce a sequence-to-structure generation paradigm that can produce events from speech signals in an end-to-end manner, together with a conditioned generation method that utilizes speech recognition transcripts as the contextual clue. We further propose to represent events with a flat format to make outputs more natural language-like. Our experimental results show that our method brings significant improvements on all datasets, achieving a maximum F1 gain of 10.7%. The code and datasets are released on https://github.com/jodie-kang/SpeechEE.
Paper Structure (27 sections, 5 equations, 5 figures, 6 tables)

This paper contains 27 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Examples of text-based event extraction (TextEE) and speech-based event extraction (SpeechEE). left: TextEE models employ raw text to generate a Transport event; right: SpeechEE models take raw speech as input and generate a Transport event.
  • Figure 2: Illustrations of two event structure linearization strategies. left: Tree Format. The red solid line indicates the event-role relation; the blue dotted line indicates the label-span relation where the head is a label, and the tail is a text span. For example, "Transport-returned" is a label-span relation edge, in which the head is "Transport" and the tail is "returned". right: Flat Format. The red represents event type and argument role, and the blue represents trigger and argument mention.
  • Figure 3: Overview of RISEN. We use a flexible encoder-decoder Transformer vaswani2017attention architecture. We froze the audio encoder and fine-tuned the text decoder. The transcript generated by the decoder serves as contextual clues, guiding the generation of the event structure.
  • Figure 4: Tree format v.s. Flat format. We implement Text2Event for TextEE, and Whisper-medium for SpeechEE. Tree format suits TextEE, while Flat Format excels in SpeechEE.
  • Figure 5: Comparison of Tree format and Flat format.