High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services
Shivasankari Kannan, Yeounoh Chung, Amita Gondi, Tristan Swadell, Fatma Ozcan
TL;DR
The paper addresses the challenge of generating high-fidelity, semantically coherent test data for complex Google SQL code generation services under restricted production data. It introduces a Gemini-based data-generation pipeline comprising Pre-processor, Context extractor, Prompt builder, Post-processor, and Data validator to produce structurally compliant data for nested schemas like protobuf and GoogleSQL. Empirical results show substantial improvements in test coverage for NL2SQL and SQL Code Assistant, along with better semantic alignment to test queries via LLM-as-a-judge and constrained validation, outperforming production sampling in many scenarios. The approach is dialect-agnostic, scalable, and cost-effective relative to manual data creation, with future work focusing on reducing hallucinations and extending semantic coverage for complex time- and constraint-based queries.
Abstract
The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically relevant high-fidelity mock data for complex data structures that includes columns with nested structures that we frequently encounter in Google workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex data structures, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate syntactically correct and semantically relevant high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the SQL test targets (queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant. Our results demonstrate the practical utility of an LLM (\textit{gemini}) based test data generation for industrial SQL code generation services where generating high-fidelity test data is essential due to the frequent unavailability and inaccessibility of production datasets for testing.
