wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech
Khai Le-Duc, Quy-Anh Dang, Tan-Hanh Pham, Truong-Son Hy
TL;DR
wav2graph introduces a pioneering framework to construct and train supervised knowledge graphs directly from speech. It builds a two-node KG from utterances and named entities, converts nodes into embeddings, and trains GNNs for node classification and link prediction under inductive and transductive settings. Across human and ASR transcripts, the study demonstrates that embedding quality, especially multilingual acoustic pre-training and multilingual LLM embeddings, substantially boosts performance, while GNN architectures exhibit task- and embedding-dependent strengths. The work provides comprehensive baselines, error analyses, and publicly released code/data/models, highlighting the potential of speech-derived KGs to enhance reasoning in AI systems and search engines.
Abstract
Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online.
