SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models

Shicheng Liu; Jialiang Xu; Wesley Tjangnaka; Sina J. Semnani; Chen Jie Yu; Monica S. Lam

SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models

Shicheng Liu, Jialiang Xu, Wesley Tjangnaka, Sina J. Semnani, Chen Jie Yu, Monica S. Lam

TL;DR

This work introduces SUQL, a formal language that extends SQL with free-text primitives to enable expressive, exact, and compositional queries over hybrid data combining structured and unstructured sources. A few-shot LLM-based semantic parser translates natural-turn queries into SUQL, which is executed by an optimizing SUQL compiler that uses dense retrieval, ENUM handling, and query-order optimizations to scale to large databases. Empirical evaluation on HybridQA shows SUQL nearly reaches state-of-the-art performance, while a real-world Yelp-based restaurant dataset demonstrates strong conversational turn accuracy and significant improvements over linearization baselines. The results illustrate SUQL’s practicality for real-world hybrid-information retrieval tasks and highlight avenues for future domain-specific applications, ethical considerations, and system refinements to bolster reliability of LLM-driven querying.

Abstract

While most conversational agents are grounded on either free-text or structured knowledge, many knowledge corpora consist of hybrid sources. This paper presents the first conversational agent that supports the full generality of hybrid data access for large knowledge corpora, through a language we developed called SUQL (Structured and Unstructured Query Language). Specifically, SUQL extends SQL with free-text primitives (summary and answer), so information retrieval can be composed with structured data accesses arbitrarily in a formal, succinct, precise, and interpretable notation. With SUQL, we propose the first semantic parser, an LLM with in-context learning, that can handle hybrid data sources. Our in-context learning-based approach, when applied to the HybridQA dataset, comes within 8.9% exact match and 7.1% F1 of the SOTA, which was trained on 62K data samples. More significantly, unlike previous approaches, our technique is applicable to large databases and free-text corpora. We introduce a dataset consisting of crowdsourced questions and conversations on Yelp, a large, real restaurant knowledge base with structured and unstructured data. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 90.3% of the time, compared to 63.4% for a baseline based on linearization.

SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 3 equations, 5 figures, 13 tables)

This paper contains 21 sections, 3 equations, 5 figures, 13 tables.

Introduction
Related Work
Design and Rationale of SUQL
Design Rationale
Design of SUQL
Conversational Agent
An Optimizing SUQL Compiler
Search and Filter Optimization
Enumerated Types
Query Order Optimizations
Experiments
HybridQA Experiment
Conversational Agent on Restaurants
Collecting User Queries
Turn Accuracy
...and 6 more sections

Figures (5)

Figure 1: Comparison of traditional approach (linearization) with our approach (semantic parsing with SUQL). Top: In the linearization approach, database entries are linearized and converted to embedding vectors. At run-time, a user request is converted to an embedding vector, which is used to find the closest embedding from the stored vectors. The results are then supplied to LLM for response generation. Bottom: In our approach (semantic parsing with SUQL), a user utterance is parsed into formal SUQL by a few-shotted LLM, which is then executed by the SUQL compiler to fetch results from the database. The results are then supplied to LLM for response generation.
Figure 2: restaurants table with both structured and unstructured data.
Figure 3: The crowdsourcing interface that our user sees
Figure 4: The prompts we give crowdsourcing workers before they start conversing with our chatbot.
Figure 5: The questions crowdsourcing workers are asked after they finish talking to the chatbot.

Theorems & Definitions (2)

Definition 5.1
Definition 5.2

SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models

TL;DR

Abstract

SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (2)