Table of Contents
Fetching ...

CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework

Jinzhong Ning, Paerhati Tulajiang, Yingying Le, Yijia Zhang, Yuanyuan Sun, Hongfei Lin, Haifeng Liu

TL;DR

This work tackles the scarcity and diversity issues in SpeechRE by introducing CommonVoice-SpeechRE, a large-scale real-human speech dataset, and RPG-MoGe, a multi-order generative framework that uses relation prompts and a CNN-based latent relation predictor to improve cross-modal alignment and triplet generation. The framework employs a Whisper-based Speech Encoder, a Latent Relation Prediction Head, and a Text Decoder guided by multiple order views of relation trees, enabling end-to-end generation of $(h,r,t)$ triples from speech. Experimental results across multiple datasets show state-of-the-art performance, with RPG-MoGe outperforming baselines and smaller Whisper backbones matching larger models, validating both the dataset and the modeling approach. The work advances SpeechRE toward real-world applicability by combining diverse data and a principled multi-view generation strategy that effectively leverages high-level semantic cues.

Abstract

Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech. However, existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech. Moreover, existing models also suffer from rigid single-order generation templates and weak semantic alignment, substantially limiting their performance. To address these challenges, we introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers, establishing a new benchmark for SpeechRE research. Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet generation ensemble strategy, leveraging data diversity through diverse element orders during both training and inference, and (2) CNN-based latent relation prediction heads that generate explicit relation prompts to guide cross-modal alignment and accurate triplet generation. Experiments show our approach outperforms state-of-the-art methods, providing both a benchmark dataset and an effective solution for real-world SpeechRE. The source code and dataset are publicly available at https://github.com/NingJinzhong/SpeechRE_RPG_MoGe.

CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework

TL;DR

This work tackles the scarcity and diversity issues in SpeechRE by introducing CommonVoice-SpeechRE, a large-scale real-human speech dataset, and RPG-MoGe, a multi-order generative framework that uses relation prompts and a CNN-based latent relation predictor to improve cross-modal alignment and triplet generation. The framework employs a Whisper-based Speech Encoder, a Latent Relation Prediction Head, and a Text Decoder guided by multiple order views of relation trees, enabling end-to-end generation of triples from speech. Experimental results across multiple datasets show state-of-the-art performance, with RPG-MoGe outperforming baselines and smaller Whisper backbones matching larger models, validating both the dataset and the modeling approach. The work advances SpeechRE toward real-world applicability by combining diverse data and a principled multi-view generation strategy that effectively leverages high-level semantic cues.

Abstract

Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech. However, existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech. Moreover, existing models also suffer from rigid single-order generation templates and weak semantic alignment, substantially limiting their performance. To address these challenges, we introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers, establishing a new benchmark for SpeechRE research. Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet generation ensemble strategy, leveraging data diversity through diverse element orders during both training and inference, and (2) CNN-based latent relation prediction heads that generate explicit relation prompts to guide cross-modal alignment and accurate triplet generation. Experiments show our approach outperforms state-of-the-art methods, providing both a benchmark dataset and an effective solution for real-world SpeechRE. The source code and dataset are publicly available at https://github.com/NingJinzhong/SpeechRE_RPG_MoGe.

Paper Structure

This paper contains 19 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Explanation of the multi-view relation tree and its linearization process. Here, "$<h>$", "$<r>$", and "$<t>$" are special tokens representing the head entity, relation type, and tail entity of the relational triple respectively.
  • Figure 2: The overall architecture of RPG-MoGe.
  • Figure 3: Implementation details for the Inference Phase in RPG-MoGe.