Table of Contents
Fetching ...

Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLMs

Reham Omar, Omij Mangukiya, Essam Mansour

TL;DR

The paper addresses the challenge of domain-specific dialogue benchmark generation from knowledge graphs and introduces Chatty-Gen, a multi-stage retrieval-augmented generation platform with assertion-based validation to mitigate hallucinations and reduce reliance on costly LLMs. It defines a KG-based dialogue benchmark $D = \{e, KG, Q, SQ\}$ and leverages subgraphs $SG(e)$ to extract rich dialogue context, using a four-stage generation pipeline (independent questions, SPARQL generation, dialogue assembly, and optional summarization) with validated outputs. The approach combines diverse node-type selection, automatic entity-label extraction, and subgraph serialization to ensure scalability across arbitrary KGs (DBpedia, Yago, DBLP, MAG) and across open-source and commercial LLMs, achieving significant time savings (e.g., orders of magnitude reduction for large KGs) while maintaining quality. Empirical results show Chatty-Gen outperforms the state-of-the-art in question quality and time efficiency and maintains consistent performance across multiple LLMs, underscoring its practical impact for scalable KG-based dialogue benchmarks.

Abstract

Dialogue benchmarks are crucial in training and evaluating chatbots engaging in domain-specific conversations. Knowledge graphs (KGs) represent semantically rich and well-organized data spanning various domains, such as DBLP, DBpedia, and YAGO. Traditionally, dialogue benchmarks have been manually created from documents, neglecting the potential of KGs in automating this process. Some question-answering benchmarks are automatically generated using extensive preprocessing from KGs, but they do not support dialogue generation. This paper introduces Chatty-Gen, a novel multi-stage retrieval-augmented generation platform for automatically generating high-quality dialogue benchmarks tailored to a specific domain using a KG. Chatty-Gen decomposes the generation process into manageable stages and uses assertion rules for automatic validation between stages. Our approach enables control over intermediate results to prevent time-consuming restarts due to hallucinations. It also reduces reliance on costly and more powerful commercial LLMs. Chatty-Gen eliminates upfront processing of the entire KG using efficient query-based retrieval to find representative subgraphs based on the dialogue context. Our experiments with several real and large KGs demonstrate that Chatty-Gen significantly outperforms state-of-the-art systems and ensures consistent model and system performance across multiple LLMs of diverse capabilities, such as GPT-4o, Gemini 1.5, Llama 3, and Mistral.

Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLMs

TL;DR

The paper addresses the challenge of domain-specific dialogue benchmark generation from knowledge graphs and introduces Chatty-Gen, a multi-stage retrieval-augmented generation platform with assertion-based validation to mitigate hallucinations and reduce reliance on costly LLMs. It defines a KG-based dialogue benchmark and leverages subgraphs to extract rich dialogue context, using a four-stage generation pipeline (independent questions, SPARQL generation, dialogue assembly, and optional summarization) with validated outputs. The approach combines diverse node-type selection, automatic entity-label extraction, and subgraph serialization to ensure scalability across arbitrary KGs (DBpedia, Yago, DBLP, MAG) and across open-source and commercial LLMs, achieving significant time savings (e.g., orders of magnitude reduction for large KGs) while maintaining quality. Empirical results show Chatty-Gen outperforms the state-of-the-art in question quality and time efficiency and maintains consistent performance across multiple LLMs, underscoring its practical impact for scalable KG-based dialogue benchmarks.

Abstract

Dialogue benchmarks are crucial in training and evaluating chatbots engaging in domain-specific conversations. Knowledge graphs (KGs) represent semantically rich and well-organized data spanning various domains, such as DBLP, DBpedia, and YAGO. Traditionally, dialogue benchmarks have been manually created from documents, neglecting the potential of KGs in automating this process. Some question-answering benchmarks are automatically generated using extensive preprocessing from KGs, but they do not support dialogue generation. This paper introduces Chatty-Gen, a novel multi-stage retrieval-augmented generation platform for automatically generating high-quality dialogue benchmarks tailored to a specific domain using a KG. Chatty-Gen decomposes the generation process into manageable stages and uses assertion rules for automatic validation between stages. Our approach enables control over intermediate results to prevent time-consuming restarts due to hallucinations. It also reduces reliance on costly and more powerful commercial LLMs. Chatty-Gen eliminates upfront processing of the entire KG using efficient query-based retrieval to find representative subgraphs based on the dialogue context. Our experiments with several real and large KGs demonstrate that Chatty-Gen significantly outperforms state-of-the-art systems and ensures consistent model and system performance across multiple LLMs of diverse capabilities, such as GPT-4o, Gemini 1.5, Llama 3, and Mistral.
Paper Structure (22 sections, 6 equations, 6 figures, 5 tables, 3 algorithms)

This paper contains 22 sections, 6 equations, 6 figures, 5 tables, 3 algorithms.

Figures (6)

  • Figure 1: An illustration of the steps required to generate a dialogue from the entity "Celine Dion" in the DBpedia KG. The subgraph of Celine Dion serves as the dialogue context.
  • Figure 2: Chatty-Gen's architecture includes two main phases: A) Dialogue Context Extraction: involves a node-type retrieval method to predict entity's textual representations from the KG and extracting seed entities with surrounding subgraphs as dialogue context. B) Dialogue Generation: employs three LLM-based steps: generating self-contained questions, formulating SPARQL queries from questions and triples, and organizing them into a coherent dialogue.
  • Figure 3: Examples of questions generated from DBLP by Maestro and Chatty-Gen, which predicts more accurate entity labels, which helps LLMs generate human-like questions.
  • Figure 4: Comparison of the diversity of question types generated by Maestro and Chatty-Gen for the three KGs.
  • Figure 5: Comparison of the node-type distribution in the KG, as achieved by Chatty-Gen and Maestro for the selected seed entities. For the KG and Chatty-Gen, 'x' denotes a rare node type, whereas for Maestro, it indicates no selected seed entities.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3