Table of Contents
Fetching ...

Tursio Database Search: How far are we from ChatGPT?

Sulbha Jain, Shivani Tripathi, Shi Qiao, Alekh Jindal

Abstract

Business users need to search enterprise databases using natural language, just as they now search the web using ChatGPT or Perplexity. However, existing benchmarks -- designed for open-domain QA or text-to-SQL -- do not evaluate the end-to-end quality of such a search experience. We present an evaluation framework for structured database search that generates realistic banking queries across varying difficulty levels and assesses answer quality using relevance, safety, and conversational metrics via an LLM-as-judge approach. We apply this framework to compare Tursio, a database search platform, against ChatGPT and Perplexity on a credit union banking schema. Our results show that Tursio achieves answer relevancy statistically comparable to both baselines (97.8% vs. 98.1% on simple, 90.0% vs. 100.0% on medium, 89.5% vs. 100.0% on hard questions), even though Tursio answers from a structured database while the baselines generate responses from the open web. We analyze the failure modes, identify database completeness as the primary bottleneck, and outline directions for improving both the evaluation methodology and the systems under evaluation.

Tursio Database Search: How far are we from ChatGPT?

Abstract

Business users need to search enterprise databases using natural language, just as they now search the web using ChatGPT or Perplexity. However, existing benchmarks -- designed for open-domain QA or text-to-SQL -- do not evaluate the end-to-end quality of such a search experience. We present an evaluation framework for structured database search that generates realistic banking queries across varying difficulty levels and assesses answer quality using relevance, safety, and conversational metrics via an LLM-as-judge approach. We apply this framework to compare Tursio, a database search platform, against ChatGPT and Perplexity on a credit union banking schema. Our results show that Tursio achieves answer relevancy statistically comparable to both baselines (97.8% vs. 98.1% on simple, 90.0% vs. 100.0% on medium, 89.5% vs. 100.0% on hard questions), even though Tursio answers from a structured database while the baselines generate responses from the open web. We analyze the failure modes, identify database completeness as the primary bottleneck, and outline directions for improving both the evaluation methodology and the systems under evaluation.
Paper Structure (30 sections, 1 equation, 21 figures, 2 tables)

This paper contains 30 sections, 1 equation, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Tursio search platform connects databases through a context graph, enabling natural language search for agents, applications, and non-expert users.
  • Figure 2: SQL-to-question token-length ratio, sorted by ratio, across benchmarks and production logs. BIRD questions are mostly literal (ratio $\approx$ 1). Enterprise workloads (BEAVER, Tursio) show significantly higher ratios, confirming that business questions require far more SQL than their surface form suggests.
  • Figure 3: Question generation pipeline. Golden questions, personas, KPIs, difficulty definitions, and the database schema are combined to generate synthetic questions via an LLM. Questions are quality-checked and then mapped to real-world equivalents for benchmarking.
  • Figure 4: Prompt template to map questions to real-world equivalents.
  • Figure 5: Answer evaluation pipeline. Synthetic custom questions are answered by Tursio, while their real-world equivalents are answered by ChatGPT and Perplexity. All responses are evaluated by DeepEval using the question, persona, and KPI as context.
  • ...and 16 more figures