Table of Contents
Fetching ...

Text2VectorSQL: Towards a Unified Interface for Vector Search and SQL Queries

Zhengren Wang, Dongwen Yao, Bozhou Li, Dongsheng Ma, Bo Li, Zhiyu Li, Feiyu Xiong, Bin Cui, Linpeng Tang, Wentao Zhang

TL;DR

This work defines Text2VectorSQL as a unified natural language interface capable of querying both structured tables and unstructured content via vector search. It introduces VectorSQLGen for synthetic data, VectorSQLBench for multi-backend holistic evaluation, and UniVectorSQL as open-source LLMs trained on synthetic data to translate NL to VectorSQL. Key findings include strong open-source performance and the critical recall degradation phenomenon when SQL filters interact with vector search, especially with JOIN operations, highlighting a need for co-optimization between query generation and execution. The dataset, benchmarks, and models lay the groundwork for next-generation unified data interfaces that seamlessly fuse SQL querying with semantic retrieval in a cross-backend setting.

Abstract

The proliferation of unstructured data poses a fundamental challenge to traditional database interfaces. While Text-to-SQL has democratized access to structured data, it remains incapable of interpreting semantic or multi-modal queries. Concurrently, vector search has emerged as the de facto standard for querying unstructured data, but its integration with SQL-termed VectorSQL-still relies on manual query crafting and lacks standardized evaluation methodologies, creating a significant gap between its potential and practical application. To bridge this fundamental gap, we introduce and formalize Text2VectorSQL, a novel task to establish a unified natural language interface for seamlessly querying both structured and unstructured data. To catalyze research in this new domain, we present a comprehensive foundational ecosystem, including: (1) A scalable and robust pipeline for synthesizing high-quality Text-to-VectorSQL training data. (2) VectorSQLBench, the first large-scale, multi-faceted benchmark for this task, encompassing 12 distinct combinations across three database backends (SQLite, PostgreSQL, ClickHouse) and four data sources (BIRD, Spider, arXiv, Wikipedia). (3) Several novel evaluation metrics designed for more nuanced performance analysis. Extensive experiments not only confirm strong baseline performance with our trained models, but also reveal the recall degradation challenge: the integration of SQL filters with vector search can lead to more pronounced result omissions than in conventional filtered vector search. By defining the core task, delivering the essential data and evaluation infrastructure, and identifying key research challenges, our work lays the essential groundwork to build the next generation of unified and intelligent data interfaces. Our repository is available at https://github.com/OpenDCAI/Text2VectorSQL.

Text2VectorSQL: Towards a Unified Interface for Vector Search and SQL Queries

TL;DR

This work defines Text2VectorSQL as a unified natural language interface capable of querying both structured tables and unstructured content via vector search. It introduces VectorSQLGen for synthetic data, VectorSQLBench for multi-backend holistic evaluation, and UniVectorSQL as open-source LLMs trained on synthetic data to translate NL to VectorSQL. Key findings include strong open-source performance and the critical recall degradation phenomenon when SQL filters interact with vector search, especially with JOIN operations, highlighting a need for co-optimization between query generation and execution. The dataset, benchmarks, and models lay the groundwork for next-generation unified data interfaces that seamlessly fuse SQL querying with semantic retrieval in a cross-backend setting.

Abstract

The proliferation of unstructured data poses a fundamental challenge to traditional database interfaces. While Text-to-SQL has democratized access to structured data, it remains incapable of interpreting semantic or multi-modal queries. Concurrently, vector search has emerged as the de facto standard for querying unstructured data, but its integration with SQL-termed VectorSQL-still relies on manual query crafting and lacks standardized evaluation methodologies, creating a significant gap between its potential and practical application. To bridge this fundamental gap, we introduce and formalize Text2VectorSQL, a novel task to establish a unified natural language interface for seamlessly querying both structured and unstructured data. To catalyze research in this new domain, we present a comprehensive foundational ecosystem, including: (1) A scalable and robust pipeline for synthesizing high-quality Text-to-VectorSQL training data. (2) VectorSQLBench, the first large-scale, multi-faceted benchmark for this task, encompassing 12 distinct combinations across three database backends (SQLite, PostgreSQL, ClickHouse) and four data sources (BIRD, Spider, arXiv, Wikipedia). (3) Several novel evaluation metrics designed for more nuanced performance analysis. Extensive experiments not only confirm strong baseline performance with our trained models, but also reveal the recall degradation challenge: the integration of SQL filters with vector search can lead to more pronounced result omissions than in conventional filtered vector search. By defining the core task, delivering the essential data and evaluation infrastructure, and identifying key research challenges, our work lays the essential groundwork to build the next generation of unified and intelligent data interfaces. Our repository is available at https://github.com/OpenDCAI/Text2VectorSQL.

Paper Structure

This paper contains 30 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of the Text2VectorSQL task, with scenarios below showing how integrating SQL queries with vector search unlocks semantic filtering, multi-modal matching and retrieval acceleration. These capabilities are indispensable for universal natural language interfaces.
  • Figure 2: The Text2VectorSQL Ecosystem. The core component is VectorSQLGen pipeline, a large-scale, automated data synthesis engine that produces high-quality training samples. Then, the synthesized data is used to train our family of UniVectorSQL models. Concurrently, a curated subset of data undergoes a more rigorous, human-review process to create VectorSQLBench, our gold-standard evaluation benchmark with a suite of novel and fine-grained metrics.
  • Figure 3: Diverse syntax of vector operations in SQLite (via sqlite-vec), PostgreSQL (via pgvector) and ClickHouse. SQLite-vec generates distance column implicitly using virtual table mechanism and vec0 engine. PostgreSQL and ClickHouse support building indexes for approximate nearest neighbor (ANN) search, while SQLite-vec does not so far.
  • Figure 4: Diversity of the VectorSQLBench. The hierarchical structure breaks down the dataset across three dimensions: (1) nine distinct linguistic styles (inner ring), ensuring robustness to varied user phrasing; (2) three levels of vectorial complexity (middle ring), detailing how vector search is integrated with SQL (Non, WHERE, JOIN); and (3) four levels of structural SQL difficulty (outer ring), ranging from Easy to Extra Hard. This multi-faceted design benefits comprehensive coverage of the challenges in Text2VectorSQL.
  • Figure 5: Recall degradation phenomenon across different models on PostgreSQL and ClickHouse. The charts show Avg. Precision, Avg. Recall, and Avg. F1 of four datasets in VectorSQLBench, categorized by hybrid integration depth. A severe drop in Avg. Recall (orange bar) is observed as the integration complexity increases from Non-integration to WHERE-Integration and is most pronounced in JOIN-Integration queries, highlighting a critical challenge in Text2VectorSQL execution.