FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

Heegyu Kim; Taeyang Jeon; Seunghwan Choi; Seungtaek Choi; Hyunsouk Cho

FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

Heegyu Kim, Taeyang Jeon, Seunghwan Choi, Seungtaek Choi, Hyunsouk Cho

TL;DR

FLEX (False-Less EXecution), a novel approach to evaluating text-to-SQL systems using large language models (LLMs) to emulate human expert-level evaluation of SQL queries, potentially reshaping the understanding of state-of-the-art performance in this field.

Abstract

Text-to-SQL systems have become crucial for translating natural language into SQL queries in various industries, enabling non-technical users to perform complex data operations. The need for accurate evaluation methods has increased as these systems have grown more sophisticated. However, the Execution Accuracy (EX), the most prevalent evaluation metric, still shows many false positives and negatives. Thus, this paper introduces FLEX (False-Less EXecution), a novel approach to evaluating text-to-SQL systems using large language models (LLMs) to emulate human expert-level evaluation of SQL queries. Our metric improves agreement with human experts (from 62 to 87.04 in Cohen's kappa) with comprehensive context and sophisticated criteria. Our extensive experiments yield several key insights: (1) Models' performance increases by over 2.6 points on average, substantially affecting rankings on Spider and BIRD benchmarks; (2) The underestimation of models in EX primarily stems from annotation quality issues; and (3) Model performance on particularly challenging questions tends to be overestimated. This work contributes to a more accurate and nuanced evaluation of text-to-SQL systems, potentially reshaping our understanding of state-of-the-art performance in this field.

FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

TL;DR

Abstract

Paper Structure (36 sections, 2 equations, 16 figures, 18 tables)

This paper contains 36 sections, 2 equations, 16 figures, 18 tables.

Introduction
Related Works
Preliminaries
Analyzing the Limitations of Current Text-to-SQL Evaluation Methods
What Types of Errors Occur in Text-to-SQL Evaluation?
How Closely Does EX Align with Human Expert Evaluations?
Can LLMs Replace EX in Evaluating Text-to-SQL Systems?
Our Proposed Metric: FLEX
Evaluation Process
Optimal Context ($\mathbb{C}_{FLEX}$)
Experiments
Does FLEX Outperform Other Metrics?
What Factors are Beneficial?
Leaderboard Re-evaluation
Experiment Setup
...and 21 more sections

Figures (16)

Figure 1: Performance comparison of EX vs. FLEX metrics on Spider benchmark. The red identity line shows an equivalent score.
Figure 2: Compared to conventional EM and EX, FLEX evaluates semantic equivalence between question and query based on holistic, contextual information.
Figure 3: Agreements between human evaluation and FLEX across LLM models over time. The red line shows EX metric agreement. Dots represent other LLMs, illustrating lower agreement than previous SOTA. Details are illustrated in Fig. \ref{['fig:flex_pub_kappa_full']}
Figure 4: Average model performances and error ratios across different model types.
Figure 5: Categorized result of FN ratios in top 10 models. Struct denotes an acceptable output structure variation, Value denotes a different representation of value, GT denotes incorrect ground truth, and Multiple Ans denotes multiple answers available.
...and 11 more figures

FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

TL;DR

Abstract

FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (16)