Table of Contents
Fetching ...

Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation

Jinwei Lu, Yuanfeng Song, Zhiqian Qin, Haodi Zhang, Chen Zhang, Raymond Chi-Wing Wong

TL;DR

This work introduces the Text-to-NoSQL task to translate natural language queries into NoSQL queries, addressing a key usability gap for NoSQL systems. It presents TEND, a large-scale, open-source benchmark, and SMART, a four-stage SLM+RAG framework that predicts schemas, generates and refines NoSQL queries, and optimizes them using execution feedback. The authors propose a semi-automatic data-construction pipeline, a suite of novel evaluation metrics, and extensive experiments showing SMART significantly outperforms strong baselines while ablations confirm each component’s value. The work lays a foundation for accessible NoSQL querying, offering a scalable pathway to real-world NL interfaces for diverse NoSQL platforms and datasets.

Abstract

NoSQL databases have become increasingly popular due to their outstanding performance in handling large-scale, unstructured, and semi-structured data, highlighting the need for user-friendly interfaces to bridge the gap between non-technical users and complex database queries. In this paper, we introduce the Text-to-NoSQL task, which aims to convert natural language queries into NoSQL queries, thereby lowering the technical barrier for non-expert users. To promote research in this area, we developed a novel automated dataset construction process and released a large-scale and open-source dataset for this task, named TEND (short for Text-to-NoSQL Dataset). Additionally, we designed a SLM (Small Language Model)-assisted and RAG (Retrieval-augmented Generation)-assisted multi-step framework called SMART, which is specifically designed for Text-to-NoSQL conversion. To ensure comprehensive evaluation of the models, we also introduced a detailed set of metrics that assess the model's performance from both the query itself and its execution results. Our experimental results demonstrate the effectiveness of our approach and establish a benchmark for future research in this emerging field. We believe that our contributions will pave the way for more accessible and intuitive interactions with NoSQL databases.

Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation

TL;DR

This work introduces the Text-to-NoSQL task to translate natural language queries into NoSQL queries, addressing a key usability gap for NoSQL systems. It presents TEND, a large-scale, open-source benchmark, and SMART, a four-stage SLM+RAG framework that predicts schemas, generates and refines NoSQL queries, and optimizes them using execution feedback. The authors propose a semi-automatic data-construction pipeline, a suite of novel evaluation metrics, and extensive experiments showing SMART significantly outperforms strong baselines while ablations confirm each component’s value. The work lays a foundation for accessible NoSQL querying, offering a scalable pathway to real-world NL interfaces for diverse NoSQL platforms and datasets.

Abstract

NoSQL databases have become increasingly popular due to their outstanding performance in handling large-scale, unstructured, and semi-structured data, highlighting the need for user-friendly interfaces to bridge the gap between non-technical users and complex database queries. In this paper, we introduce the Text-to-NoSQL task, which aims to convert natural language queries into NoSQL queries, thereby lowering the technical barrier for non-expert users. To promote research in this area, we developed a novel automated dataset construction process and released a large-scale and open-source dataset for this task, named TEND (short for Text-to-NoSQL Dataset). Additionally, we designed a SLM (Small Language Model)-assisted and RAG (Retrieval-augmented Generation)-assisted multi-step framework called SMART, which is specifically designed for Text-to-NoSQL conversion. To ensure comprehensive evaluation of the models, we also introduced a detailed set of metrics that assess the model's performance from both the query itself and its execution results. Our experimental results demonstrate the effectiveness of our approach and establish a benchmark for future research in this emerging field. We believe that our contributions will pave the way for more accessible and intuitive interactions with NoSQL databases.

Paper Structure

This paper contains 34 sections, 4 figures, 12 tables, 2 algorithms.

Figures (4)

  • Figure 1: An example of Text-to-NoSQL involves a Text-to-NoSQL model converting a user's natural language query into a NoSQL query, which is then executed in the corresponding NoSQL database to obtain the desired result.
  • Figure 2: Construction pipeline for the TEND dataset. Through an automated process, the conversion from a Text-to-SQL dataset to a Text-to-NoSQL dataset can be achieved. Specifically, (i) the conversion from SQL databases to NoSQL (e.g., MongoDB) databases is accomplished through algorithmically programmed software; (ii) by inputting examples into the most advanced LLM (such as GPT-4) to obtain examples of Chain-of-Thought, namely Advanced COT, which assists a second-tier LLM (such as GPT-3.5) in reasoning, thus achieving automated generation of NoSQL queries, feedback generation, and query debug; (iii) the expansion of dataset questions is implemented using Multi-LLM.
  • Figure 3: The working pipeline of our proposed SMART framework, where (i) SLM-based Schema Prediction predicts the required NoSQL schemas by fine-tuning the SLM. (ii) SLM-based Query Generation generates initial NoSQL queries by fine-tuning the SLM. (iii) Predicted Schema-driven and Retrieved Example-driven Query Refinement is responsible for refining the NoSQL queries generated by the SLM. The RAG technique used here is detailed in Section \ref{['sec:query_refine']}. (iv) Execution Result-based Query Optimization optimizes the refined NoSQL queries based on their execution results.
  • Figure 4: Parameter study. These are the variation curves of SMART under different numbers of retrieval examples and different metrics. The vertical axis represents the model's accuracy under a specific metric, and the horizontal axis represents the number of retrieval examples.