Table of Contents
Fetching ...

High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services

Shivasankari Kannan, Yeounoh Chung, Amita Gondi, Tristan Swadell, Fatma Ozcan

TL;DR

The paper addresses the challenge of generating high-fidelity, semantically coherent test data for complex Google SQL code generation services under restricted production data. It introduces a Gemini-based data-generation pipeline comprising Pre-processor, Context extractor, Prompt builder, Post-processor, and Data validator to produce structurally compliant data for nested schemas like protobuf and GoogleSQL. Empirical results show substantial improvements in test coverage for NL2SQL and SQL Code Assistant, along with better semantic alignment to test queries via LLM-as-a-judge and constrained validation, outperforming production sampling in many scenarios. The approach is dialect-agnostic, scalable, and cost-effective relative to manual data creation, with future work focusing on reducing hallucinations and extending semantic coverage for complex time- and constraint-based queries.

Abstract

The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically relevant high-fidelity mock data for complex data structures that includes columns with nested structures that we frequently encounter in Google workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex data structures, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate syntactically correct and semantically relevant high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the SQL test targets (queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant. Our results demonstrate the practical utility of an LLM (\textit{gemini}) based test data generation for industrial SQL code generation services where generating high-fidelity test data is essential due to the frequent unavailability and inaccessibility of production datasets for testing.

High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services

TL;DR

The paper addresses the challenge of generating high-fidelity, semantically coherent test data for complex Google SQL code generation services under restricted production data. It introduces a Gemini-based data-generation pipeline comprising Pre-processor, Context extractor, Prompt builder, Post-processor, and Data validator to produce structurally compliant data for nested schemas like protobuf and GoogleSQL. Empirical results show substantial improvements in test coverage for NL2SQL and SQL Code Assistant, along with better semantic alignment to test queries via LLM-as-a-judge and constrained validation, outperforming production sampling in many scenarios. The approach is dialect-agnostic, scalable, and cost-effective relative to manual data creation, with future work focusing on reducing hallucinations and extending semantic coverage for complex time- and constraint-based queries.

Abstract

The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically relevant high-fidelity mock data for complex data structures that includes columns with nested structures that we frequently encounter in Google workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex data structures, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate syntactically correct and semantically relevant high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the SQL test targets (queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant. Our results demonstrate the practical utility of an LLM (\textit{gemini}) based test data generation for industrial SQL code generation services where generating high-fidelity test data is essential due to the frequent unavailability and inaccessibility of production datasets for testing.

Paper Structure

This paper contains 23 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Simple selection query accessing deeply nested fields on a fake table, currency and amount.
  • Figure 2: A natural language question and the golden SQL query with a join within the subquery.
  • Figure 3: The proposed LLM (gemini) based test data generation process. User provides context map (questions, schema, additional criteria) and test body (SQL query) as input. This data generation request triggers a series of data processing steps and modules. The goal is to generate syntactically correct and semantically relevant test data for SQL code generation services, like NL2SQL. We leverage the latest LLM capabilities to do this efficiently at scale.
  • Figure 4: Pre-processor parses out sub-queries (middle) from the original SQL query (top) for seeding and independent and dependent data generation. Join information is extracted to highlight the table and column association for the join (bottom).
  • Figure 5: Data generation prompt template with system and general instructions. Custom generation seeding instructions can be passed as part of data_generation_signals and user_instruction.
  • ...and 7 more figures