Table of Contents
Fetching ...

FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

Heegyu Kim, Taeyang Jeon, Seunghwan Choi, Seungtaek Choi, Hyunsouk Cho

TL;DR

FLEX (False-Less EXecution), a novel approach to evaluating text-to-SQL systems using large language models (LLMs) to emulate human expert-level evaluation of SQL queries, potentially reshaping the understanding of state-of-the-art performance in this field.

Abstract

Text-to-SQL systems have become crucial for translating natural language into SQL queries in various industries, enabling non-technical users to perform complex data operations. The need for accurate evaluation methods has increased as these systems have grown more sophisticated. However, the Execution Accuracy (EX), the most prevalent evaluation metric, still shows many false positives and negatives. Thus, this paper introduces FLEX (False-Less EXecution), a novel approach to evaluating text-to-SQL systems using large language models (LLMs) to emulate human expert-level evaluation of SQL queries. Our metric improves agreement with human experts (from 62 to 87.04 in Cohen's kappa) with comprehensive context and sophisticated criteria. Our extensive experiments yield several key insights: (1) Models' performance increases by over 2.6 points on average, substantially affecting rankings on Spider and BIRD benchmarks; (2) The underestimation of models in EX primarily stems from annotation quality issues; and (3) Model performance on particularly challenging questions tends to be overestimated. This work contributes to a more accurate and nuanced evaluation of text-to-SQL systems, potentially reshaping our understanding of state-of-the-art performance in this field.

FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

TL;DR

FLEX (False-Less EXecution), a novel approach to evaluating text-to-SQL systems using large language models (LLMs) to emulate human expert-level evaluation of SQL queries, potentially reshaping the understanding of state-of-the-art performance in this field.

Abstract

Text-to-SQL systems have become crucial for translating natural language into SQL queries in various industries, enabling non-technical users to perform complex data operations. The need for accurate evaluation methods has increased as these systems have grown more sophisticated. However, the Execution Accuracy (EX), the most prevalent evaluation metric, still shows many false positives and negatives. Thus, this paper introduces FLEX (False-Less EXecution), a novel approach to evaluating text-to-SQL systems using large language models (LLMs) to emulate human expert-level evaluation of SQL queries. Our metric improves agreement with human experts (from 62 to 87.04 in Cohen's kappa) with comprehensive context and sophisticated criteria. Our extensive experiments yield several key insights: (1) Models' performance increases by over 2.6 points on average, substantially affecting rankings on Spider and BIRD benchmarks; (2) The underestimation of models in EX primarily stems from annotation quality issues; and (3) Model performance on particularly challenging questions tends to be overestimated. This work contributes to a more accurate and nuanced evaluation of text-to-SQL systems, potentially reshaping our understanding of state-of-the-art performance in this field.
Paper Structure (36 sections, 2 equations, 16 figures, 18 tables)

This paper contains 36 sections, 2 equations, 16 figures, 18 tables.

Figures (16)

  • Figure 1: Performance comparison of EX vs. FLEX metrics on Spider benchmark. The red identity line shows an equivalent score.
  • Figure 2: Compared to conventional EM and EX, FLEX evaluates semantic equivalence between question and query based on holistic, contextual information.
  • Figure 3: Agreements between human evaluation and FLEX across LLM models over time. The red line shows EX metric agreement. Dots represent other LLMs, illustrating lower agreement than previous SOTA. Details are illustrated in Fig. \ref{['fig:flex_pub_kappa_full']}
  • Figure 4: Average model performances and error ratios across different model types.
  • Figure 5: Categorized result of FN ratios in top 10 models. Struct denotes an acceptable output structure variation, Value denotes a different representation of value, GT denotes incorrect ground truth, and Multiple Ans denotes multiple answers available.
  • ...and 11 more figures