Table of Contents
Fetching ...

NL2KQL: From Natural Language to Kusto Query

Xinye Tang, Amir H. Abdi, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, Ye Xing

TL;DR

NL2KQL presents an end-to-end framework that translates natural language queries into Kusto Query Language (KQL) using large language models. It introduces a modular pipeline—Semantic Data Catalog, Schema Refiner, Synthetic Few-shot Database, Few-shot Selector, Prompt Builder, and Query Refiner—augmented by a synthetic NLQ-KQL dataset and a dedicated evaluation regime with offline and online metrics. The authors validate NL2KQL on Defender and Sentinel databases, performing thorough ablation studies that demonstrate the importance of schema grounding, targeted few-shots, and post-generation refinement for robust KQL generation. The work provides a publicly available benchmark and demonstrates that carefully designed prompting, grounding, and validation significantly improve NL-to-KQL translation, offering practical benefits for large-scale, semi-structured data analytics.

Abstract

Data is growing rapidly in volume and complexity. Proficiency in database query languages is pivotal for crafting effective queries. As coding assistants become more prevalent, there is significant opportunity to enhance database query languages. The Kusto Query Language (KQL) is a widely used query language for large semi-structured data such as logs, telemetries, and time-series for big data analytics platforms. This paper introduces NL2KQL an innovative framework that uses large language models (LLMs) to convert natural language queries (NLQs) to KQL queries. The proposed NL2KQL framework includes several key components: Schema Refiner which narrows down the schema to its most pertinent elements; the Few-shot Selector which dynamically selects relevant examples from a few-shot dataset; and the Query Refiner which repairs syntactic and semantic errors in KQL queries. Additionally, this study outlines a method for generating large datasets of synthetic NLQ-KQL pairs which are valid within a specific database contexts. To validate NL2KQL's performance, we utilize an array of online (based on query execution) and offline (based on query parsing) metrics. Through ablation studies, the significance of each framework component is examined, and the datasets used for benchmarking are made publicly available. This work is the first of its kind and is compared with available baselines to demonstrate its effectiveness.

NL2KQL: From Natural Language to Kusto Query

TL;DR

NL2KQL presents an end-to-end framework that translates natural language queries into Kusto Query Language (KQL) using large language models. It introduces a modular pipeline—Semantic Data Catalog, Schema Refiner, Synthetic Few-shot Database, Few-shot Selector, Prompt Builder, and Query Refiner—augmented by a synthetic NLQ-KQL dataset and a dedicated evaluation regime with offline and online metrics. The authors validate NL2KQL on Defender and Sentinel databases, performing thorough ablation studies that demonstrate the importance of schema grounding, targeted few-shots, and post-generation refinement for robust KQL generation. The work provides a publicly available benchmark and demonstrates that carefully designed prompting, grounding, and validation significantly improve NL-to-KQL translation, offering practical benefits for large-scale, semi-structured data analytics.

Abstract

Data is growing rapidly in volume and complexity. Proficiency in database query languages is pivotal for crafting effective queries. As coding assistants become more prevalent, there is significant opportunity to enhance database query languages. The Kusto Query Language (KQL) is a widely used query language for large semi-structured data such as logs, telemetries, and time-series for big data analytics platforms. This paper introduces NL2KQL an innovative framework that uses large language models (LLMs) to convert natural language queries (NLQs) to KQL queries. The proposed NL2KQL framework includes several key components: Schema Refiner which narrows down the schema to its most pertinent elements; the Few-shot Selector which dynamically selects relevant examples from a few-shot dataset; and the Query Refiner which repairs syntactic and semantic errors in KQL queries. Additionally, this study outlines a method for generating large datasets of synthetic NLQ-KQL pairs which are valid within a specific database contexts. To validate NL2KQL's performance, we utilize an array of online (based on query execution) and offline (based on query parsing) metrics. Through ablation studies, the significance of each framework component is examined, and the datasets used for benchmarking are made publicly available. This work is the first of its kind and is compared with available baselines to demonstrate its effectiveness.
Paper Structure (34 sections, 7 equations, 3 figures, 2 tables)

This paper contains 34 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of NL2KQL inference pipeline.
  • Figure 2: Synthetic few-shot generation and round-trip validation process.
  • Figure 3: Overview of embedding stores