Table of Contents
Fetching ...

Are LLMs Overkill for Databases?: A Study on the Finiteness of SQL

Yue Li, David Mimno, Unso Eun Seo Jo

Abstract

Translating natural language to SQL for data retrieval has become more accessible thanks to code generation LLMs. But how hard is it to generate SQL code? While databases can become unbounded in complexity, the complexity of queries is bounded by real life utility and human needs. With a sample of 376 databases, we show that SQL queries, as translations of natural language questions are finite in practical complexity. There is no clear monotonic relationship between increases in database table count and increases in complexity of SQL queries. In their template forms, SQL queries follow a Power Law-like distribution of frequency where 70% of our tested queries can be covered with just 13% of all template types, indicating that the high majority of SQL queries are predictable. This suggests that while LLMs for code generation can be useful, in the domain of database access, they may be operating in a narrow, highly formulaic space where templates could be safer, cheaper, and auditable.

Are LLMs Overkill for Databases?: A Study on the Finiteness of SQL

Abstract

Translating natural language to SQL for data retrieval has become more accessible thanks to code generation LLMs. But how hard is it to generate SQL code? While databases can become unbounded in complexity, the complexity of queries is bounded by real life utility and human needs. With a sample of 376 databases, we show that SQL queries, as translations of natural language questions are finite in practical complexity. There is no clear monotonic relationship between increases in database table count and increases in complexity of SQL queries. In their template forms, SQL queries follow a Power Law-like distribution of frequency where 70% of our tested queries can be covered with just 13% of all template types, indicating that the high majority of SQL queries are predictable. This suggests that while LLMs for code generation can be useful, in the domain of database access, they may be operating in a narrow, highly formulaic space where templates could be safer, cheaper, and auditable.

Paper Structure

This paper contains 31 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Frequency curve of soft templates.
  • Figure 2: Moving average values of six proxies (window size = 15).
  • Figure 3: Log--log plots for hard and soft templates.
  • Figure 4: Frequency curves for hard and soft templates.
  • Figure 5: Individual Plots for Six Proxies. Each point represents the average value of a proxy metric for a specific table count. The dashed line denotes the moving average of the proxy values with a window size of 15.