Table of Contents
Fetching ...

Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]

Songlin Lyu, Lujie Ban, Zihang Wu, Tianqi Luo, Jirong Liu, Chenhao Ma, Yuyu Luo, Nan Tang, Shipeng Qi, Heng Lin, Yongchao Liu, Chuntao Hong

TL;DR

This work tackles the lack of standardized benchmarks for Text-to-GQL by introducing Text2GQL-Bench, a large, multi-GQL, multi-domain benchmark built via a scalable construction framework and evaluated with a comprehensive set of metrics. It combines data from existing graph and relational benchmarks with synthesized general-domain graphs, employs a Graph-IR to translate and generate queries across dialects, and uses hierarchical question generation to cover diverse user intents. Empirical results reveal a pronounced ISO-GQL dialect gap that improves with few-shot prompts and domain-aligned supervision, while model scale alone yields limited gains. The framework offers an extensible platform for evaluating and advancing Text-to-GQL research across additional dialects and richer graph workloads.

Abstract

Graph models are fundamental to data analysis in domains rich with complex relationships. Text-to-Graph-Query-Language (Text-to-GQL) systems act as a translator, converting natural language into executable graph queries. This capability allows Large Language Models (LLMs) to directly analyze and manipulate graph data, posi-tioning them as powerful agent infrastructures for Graph Database Management System (GDBMS). Despite recent progress, existing datasets are often limited in domain coverage, supported graph query languages, or evaluation scope. The advancement of Text-to-GQL systems is hindered by the lack of high-quality benchmark datasets and evaluation methods to systematically compare model capabilities across different graph query languages and domains. In this work, we present Text2GQL-Bench, a unified Text-to-GQL benchmark designed to address these limitations. Text2GQL-Bench couples a multi-GQL dataset that has 178,184 (Question, Query) pairs spanning 13 domains, with a scalable construction framework that generates datasets in different domains, question abstraction levels, and GQLs with heterogeneous resources. To support compre-hensive assessment, we introduce an evaluation method that goes beyond a single end-to-end metric by jointly reporting grammatical validity, similarity, semantic alignment, and execution accuracy. Our evaluation uncovers a stark dialect gap in ISO-GQL generation: even strong LLMs achieve only at most 4% execution accuracy (EX) in zero-shot settings, though a fixed 3-shot prompt raises accuracy to around 50%, the grammatical validity remains lower than 70%. Moreover, a fine-tuned 8B open-weight model reaches 45.1% EX, and 90.8% grammatical validity, demonstrating that most of the performance jump is unlocked by exposure to sufficient ISO-GQL examples.

Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]

TL;DR

This work tackles the lack of standardized benchmarks for Text-to-GQL by introducing Text2GQL-Bench, a large, multi-GQL, multi-domain benchmark built via a scalable construction framework and evaluated with a comprehensive set of metrics. It combines data from existing graph and relational benchmarks with synthesized general-domain graphs, employs a Graph-IR to translate and generate queries across dialects, and uses hierarchical question generation to cover diverse user intents. Empirical results reveal a pronounced ISO-GQL dialect gap that improves with few-shot prompts and domain-aligned supervision, while model scale alone yields limited gains. The framework offers an extensible platform for evaluating and advancing Text-to-GQL research across additional dialects and richer graph workloads.

Abstract

Graph models are fundamental to data analysis in domains rich with complex relationships. Text-to-Graph-Query-Language (Text-to-GQL) systems act as a translator, converting natural language into executable graph queries. This capability allows Large Language Models (LLMs) to directly analyze and manipulate graph data, posi-tioning them as powerful agent infrastructures for Graph Database Management System (GDBMS). Despite recent progress, existing datasets are often limited in domain coverage, supported graph query languages, or evaluation scope. The advancement of Text-to-GQL systems is hindered by the lack of high-quality benchmark datasets and evaluation methods to systematically compare model capabilities across different graph query languages and domains. In this work, we present Text2GQL-Bench, a unified Text-to-GQL benchmark designed to address these limitations. Text2GQL-Bench couples a multi-GQL dataset that has 178,184 (Question, Query) pairs spanning 13 domains, with a scalable construction framework that generates datasets in different domains, question abstraction levels, and GQLs with heterogeneous resources. To support compre-hensive assessment, we introduce an evaluation method that goes beyond a single end-to-end metric by jointly reporting grammatical validity, similarity, semantic alignment, and execution accuracy. Our evaluation uncovers a stark dialect gap in ISO-GQL generation: even strong LLMs achieve only at most 4% execution accuracy (EX) in zero-shot settings, though a fixed 3-shot prompt raises accuracy to around 50%, the grammatical validity remains lower than 70%. Moreover, a fine-tuned 8B open-weight model reaches 45.1% EX, and 90.8% grammatical validity, demonstrating that most of the performance jump is unlocked by exposure to sufficient ISO-GQL examples.
Paper Structure (29 sections, 2 equations, 8 figures, 7 tables)

This paper contains 29 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: An example of Text-to-GQL task on a financial graph. With the input graph schema and question, the task aims to translate the user intent into a graph query.
  • Figure 2: Example showing the difference between SQL and GQL. SQL needs complex join operations and does not support variable length query natively, but GQL can express variable-length query in a natural way.
  • Figure 3: Framework Overview. Our dataset construction framework builds a Text-to-GQL dataset from heterogeneous data sources in 4 stages: (i) Schema Translation & Generation: convert the existing dataset schemas into graph schemas, and generate schema for general domain subsets; (ii) Data Conversion & Generation: converting existing data or synthesizing data on given graph schemas to create executable graph database instances on target GDBMS. (iii) Query Translation & Generation: enerating executable GQL queries via automated translation from existing Cypher or SQL queries, and LLM-based synthesis; (iv) Hierarchical Question Generation: annotating GQL queries in different question abstraction levels.
  • Figure 4: Data Generation Pipeline. With the SchemaGraph intermediate representation and data distribution setting, the LLM-based logic generator can provide an executable Python script to generate data with execution-based verification. The data will be stored in .csv or .parquet files.
  • Figure 6: Data Domain Distribution w.r.t. size.
  • ...and 3 more figures