Table of Contents
Fetching ...

LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings

Lifu Tu, Rongguang Wang, Tao Sheng, Sujjith Ravi, Dan Roth

Abstract

Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms. Furthermore, we observe that surface-level noise causes larger performance drops in traditional pipelines, whereas linguistic variation presents greater challenges in agentic settings. These findings highlight the remaining challenges in achieving robust NL2SQL systems, particularly in handling linguistic variability.

LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings

Abstract

Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms. Furthermore, we observe that surface-level noise causes larger performance drops in traditional pipelines, whereas linguistic variation presents greater challenges in agentic settings. These findings highlight the remaining challenges in achieving robust NL2SQL systems, particularly in handling linguistic variability.
Paper Structure (36 sections, 1 equation, 5 figures, 2 tables)

This paper contains 36 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Results on R-NL2SQL-Tradition (adapt from Spider 1 yu-etal-2018-spider). LLMs generally maintain strong performance under several perturbations; however, notable performance degradation is still observed.
  • Figure 2: Results for R-NL2SQL-Agentic (adapted from Spider 2 lei2024spider). Using the Spider-Agent framework lei2024spider (following ReAct yao2023react)for evaluation, we substitute different LLMs. Earlier LLMs (e.g., GPT-4.1) achieved very low results. Traditional pipelines degrade more under surface-level noises, whereas agentic setups are more challenged by linguistic variations.
  • Figure 3: Execution accuracy of "one-perturbation" and "all-perturbation" in the agentic setting, confirming that linguistic variations larger performance differences.
  • Figure 4: Execution accuracy when extra databases are added in the agentic setting.
  • Figure 5: Execution accuracy of GPT-4.1 in the agentic setting. The low accuracy indicates that earlier LLMs have much weaker agentic capabilities.