Table of Contents
Fetching ...

Querying Databases with Function Calling

Connor Shorten, Charles Pierse, Thomas Benjamin Smith, Karel D'Oosterlinck, Tuana Celik, Erika Cardenas, Leonie Monigatti, Mohd Shukri Hasan, Edward Schmuhl, Daniel Williams, Aravind Kesiraju, Bob van Luijt

TL;DR

This work investigates enabling natural-language querying of databases through a unified Function Calling interface. It decouples query operators from SQL, uniting search, filters, aggregations, and grouping into a single tool (query_database) and evaluates it with DBGorilla, a synthetic benchmark built on Gorilla that covers five schemas and 315 queries. Across eight LLMs from five families, top performers (Claude 3.5 Sonnet, GPT-4o variants) achieve high Exact Match and AST scores, with robust handling of boolean operators but weaker performance on text filters. The study also explores ablations (rationale, parallel calls, per-collection tooling, structured outputs) and analyzes maintenance costs, providing a foundation for scalable, language-enabled database querying in Compound AI Systems and offering open-source resources for replication and extension.

Abstract

The capabilities of Large Language Models (LLMs) are rapidly accelerating largely thanks to their integration with external tools. Querying databases is among the most effective of these integrations, enabling LLMs to access private or continually updating data. While Function Calling is the most common method for interfacing external tools to LLMs, its application to database querying as a tool has been underexplored. We propose a tool definition for database querying that unifies accessing data with search queries, filters, or a combination both, as well as transforming results with aggregation and groupby operators. To evaluate its effectiveness, we conduct a study with 8 LLMs spanning 5 model families. We present a novel pipeline adapting the Gorilla LLM framework to create synthetic database schemas and queries. We primarily evaluate the models with the Exact Match of predicted and ground truth query APIs. Among the models tested, Claude 3.5 Sonnet achieves the highest performance with an Exact Match score of 74.3%, followed by GPT-4o mini at 73.7%, and GPT-4o at 71.8%. We further breakdown these results per API component utilized and across synthetic use cases. We find that LLMs are highly effective at utilizing operators on boolean properties, but struggle with text property filters. Across use cases we find robust results with the higher performing models such as GPT-4o, but significant performance variance across use cases from lower performing models. We additionally conduct ablation studies exploring the impact of parallel tool calling, adding a rationale as an argument of the tool call, using a separate tool per database collection, and tool calling with structured outputs. Our findings demonstrate the effectiveness of enabling LLMs to query databases with Function Calling. We have open-sourced our experimental code and results at github.com/weaviate/gorilla.

Querying Databases with Function Calling

TL;DR

This work investigates enabling natural-language querying of databases through a unified Function Calling interface. It decouples query operators from SQL, uniting search, filters, aggregations, and grouping into a single tool (query_database) and evaluates it with DBGorilla, a synthetic benchmark built on Gorilla that covers five schemas and 315 queries. Across eight LLMs from five families, top performers (Claude 3.5 Sonnet, GPT-4o variants) achieve high Exact Match and AST scores, with robust handling of boolean operators but weaker performance on text filters. The study also explores ablations (rationale, parallel calls, per-collection tooling, structured outputs) and analyzes maintenance costs, providing a foundation for scalable, language-enabled database querying in Compound AI Systems and offering open-source resources for replication and extension.

Abstract

The capabilities of Large Language Models (LLMs) are rapidly accelerating largely thanks to their integration with external tools. Querying databases is among the most effective of these integrations, enabling LLMs to access private or continually updating data. While Function Calling is the most common method for interfacing external tools to LLMs, its application to database querying as a tool has been underexplored. We propose a tool definition for database querying that unifies accessing data with search queries, filters, or a combination both, as well as transforming results with aggregation and groupby operators. To evaluate its effectiveness, we conduct a study with 8 LLMs spanning 5 model families. We present a novel pipeline adapting the Gorilla LLM framework to create synthetic database schemas and queries. We primarily evaluate the models with the Exact Match of predicted and ground truth query APIs. Among the models tested, Claude 3.5 Sonnet achieves the highest performance with an Exact Match score of 74.3%, followed by GPT-4o mini at 73.7%, and GPT-4o at 71.8%. We further breakdown these results per API component utilized and across synthetic use cases. We find that LLMs are highly effective at utilizing operators on boolean properties, but struggle with text property filters. Across use cases we find robust results with the higher performing models such as GPT-4o, but significant performance variance across use cases from lower performing models. We additionally conduct ablation studies exploring the impact of parallel tool calling, adding a rationale as an argument of the tool call, using a separate tool per database collection, and tool calling with structured outputs. Our findings demonstrate the effectiveness of enabling LLMs to query databases with Function Calling. We have open-sourced our experimental code and results at github.com/weaviate/gorilla.

Paper Structure

This paper contains 29 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: DBGorilla Leaderboard results (last updated January 1st, 2025). The Exact Match and AST Score columns report the respective averages across all tested queries. Query scores are further separated into categories of "Simple", "Moderate", and "Complex" according to how many arguments are used in the ground truth function call with 1, 2, and 3 or more, respectively. Collection routing reports the percentage the predicted query is routed to the correct database collection.
  • Figure 2: An illustration of a natural language command, How many menu items are priced under 20?, translated to Function Calling arguments for database querying.
  • Figure 3: Examples of queries in the BIRD Text-to-SQL benchmark birdsql. We visualize these to help readers gain a better understanding of how Text-to-SQL is currently studied and how BIRD differs from DBGorilla.
  • Figure 4: An illustration of the Function Calling loop. Beginning with the user's input prompt, the LLM then enters a loop where it can either choose to call one or multiple functions, or return a response to the user. If a function is called, the function is executed, the response is sent back to the LLM, and the Function Calling loop continues.
  • Figure 5: Radar plots highlighting how well each model tested can access particular Search Database API components.
  • ...and 2 more figures