Table of Contents
Fetching ...

wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech

Khai Le-Duc, Quy-Anh Dang, Tan-Hanh Pham, Truong-Son Hy

TL;DR

wav2graph introduces a pioneering framework to construct and train supervised knowledge graphs directly from speech. It builds a two-node KG from utterances and named entities, converts nodes into embeddings, and trains GNNs for node classification and link prediction under inductive and transductive settings. Across human and ASR transcripts, the study demonstrates that embedding quality, especially multilingual acoustic pre-training and multilingual LLM embeddings, substantially boosts performance, while GNN architectures exhibit task- and embedding-dependent strengths. The work provides comprehensive baselines, error analyses, and publicly released code/data/models, highlighting the potential of speech-derived KGs to enhance reasoning in AI systems and search engines.

Abstract

Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online.

wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech

TL;DR

wav2graph introduces a pioneering framework to construct and train supervised knowledge graphs directly from speech. It builds a two-node KG from utterances and named entities, converts nodes into embeddings, and trains GNNs for node classification and link prediction under inductive and transductive settings. Across human and ASR transcripts, the study demonstrates that embedding quality, especially multilingual acoustic pre-training and multilingual LLM embeddings, substantially boosts performance, while GNN architectures exhibit task- and embedding-dependent strengths. The work provides comprehensive baselines, error analyses, and publicly released code/data/models, highlighting the potential of speech-derived KGs to enhance reasoning in AI systems and search engines.

Abstract

Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online.
Paper Structure (44 sections, 10 equations, 26 figures, 9 tables)

This paper contains 44 sections, 10 equations, 26 figures, 9 tables.

Figures (26)

  • Figure 1: Visualization of our wav2graph framework. We train GNNs on the KG that is built from human transcript and its corresponding NEs. Then we infer directly on another KG that is built from ASR transcript to acquire node attributes and node relationships.
  • Figure 2: An example of our KG. Node 101, 102, 103 is the utterance identification number, while remaining nodes are NEs. We follow entity-utterance-entity approach al2020named to present relationship between nodes in the KG.
  • Figure 3: Loss at each iteration with SAGE model.
  • Figure 4: Loss at each iteration with GCN model.
  • Figure 5: Loss at each iteration with GAT model.
  • ...and 21 more figures