Table of Contents
Fetching ...

Auto-Cypher: Improving LLMs on Cypher generation via LLM-supervised generation-verification framework

Aman Tiwari, Shiva Krishna Reddy Malay, Vikas Yadav, Masoud Hashemi, Sathwik Tejaswi Madhusudhan

TL;DR

Auto-Cypher addresses the gap in Text2Cypher by introducing SynthCypher, a fully automated LLM-supervised data-generation and validation pipeline that uses LLMs as database fillers to ensure executable Cypher queries. The pipeline yields a large, diverse synthetic dataset across 109 query types and 700 domains, enabling supervised fine-tuning of open-source LLMs and an adapted SPIDER-Cypher benchmark for evaluation. Finetuning open models on SynthCypher achieves substantial performance gains (up to 40 percentage points for 7B/8B models and ~30 points on SPIDER-Cypher), demonstrating the value of guided, executable data generation for graph NL-to-Cypher tasks. The work provides practical benchmarks and a scalable approach to improve Cypher generation in graph databases like Neo4j, with implications for broader NL-to-graph-query systems.

Abstract

Graph databases like Neo4j are gaining popularity for handling complex, interconnected data, over traditional relational databases in modeling and querying relationships. While translating natural language into SQL queries is well-researched, generating Cypher queries for Neo4j remains relatively underexplored. In this work, we present an automated, LLM-Supervised, pipeline to generate high-quality synthetic data for Text2Cypher. Our Cypher data generation pipeline introduces LLM-As-Database-Filler, a novel strategy for ensuring Cypher query correctness, thus resulting in high quality generations. Using our pipeline, we generate high quality Text2Cypher data - SynthCypher containing 29.8k instances across various domains and queries with varying complexities. Training open-source LLMs like LLaMa-3.1-8B, Mistral-7B, and QWEN-7B on SynthCypher results in performance gains of up to 40% on the Text2Cypher test split and 30% on the SPIDER benchmark, adapted for graph databases.

Auto-Cypher: Improving LLMs on Cypher generation via LLM-supervised generation-verification framework

TL;DR

Auto-Cypher addresses the gap in Text2Cypher by introducing SynthCypher, a fully automated LLM-supervised data-generation and validation pipeline that uses LLMs as database fillers to ensure executable Cypher queries. The pipeline yields a large, diverse synthetic dataset across 109 query types and 700 domains, enabling supervised fine-tuning of open-source LLMs and an adapted SPIDER-Cypher benchmark for evaluation. Finetuning open models on SynthCypher achieves substantial performance gains (up to 40 percentage points for 7B/8B models and ~30 points on SPIDER-Cypher), demonstrating the value of guided, executable data generation for graph NL-to-Cypher tasks. The work provides practical benchmarks and a scalable approach to improve Cypher generation in graph databases like Neo4j, with implications for broader NL-to-graph-query systems.

Abstract

Graph databases like Neo4j are gaining popularity for handling complex, interconnected data, over traditional relational databases in modeling and querying relationships. While translating natural language into SQL queries is well-researched, generating Cypher queries for Neo4j remains relatively underexplored. In this work, we present an automated, LLM-Supervised, pipeline to generate high-quality synthetic data for Text2Cypher. Our Cypher data generation pipeline introduces LLM-As-Database-Filler, a novel strategy for ensuring Cypher query correctness, thus resulting in high quality generations. Using our pipeline, we generate high quality Text2Cypher data - SynthCypher containing 29.8k instances across various domains and queries with varying complexities. Training open-source LLMs like LLaMa-3.1-8B, Mistral-7B, and QWEN-7B on SynthCypher results in performance gains of up to 40% on the Text2Cypher test split and 30% on the SPIDER benchmark, adapted for graph databases.

Paper Structure

This paper contains 17 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Example figure showing input Natural Language Query which is converted to Cypher Query for the given Schema. The example on top shows an easy retrieval question while bottom example shows complex Multi-Attribute and Multi-Relationship Query.
  • Figure 2: Overview of the SynthCypher data generation pipeline, illustrating domain and schema creation, query and ground truth generation, database population, Cypher query generation, and validation steps.
  • Figure 3: Evaluation on SynthCypher and SPIDER test splits from Llama3.1-8B fine-tuned with equal train size of down-sampled SynthCypher (ours) data and Neo4j Text2Cypher data.
  • Figure 4: Skeleton schema generation step using Mixtral-8*22B
  • Figure 5: Complete schema generation step using Mixtral-8*22B
  • ...and 9 more figures