Table of Contents
Fetching ...

From Text to CQL: Bridging Natural Language and Corpus Search Engine

Luming Lu, Jiyuan An, Yujie Wang, Liner yang, Cunliang Kong, Zhenghao Liu, Shuo Wang, Haozhe Lin, Mingwei Fang, Yaping Huang, Erhong Yang

TL;DR

This work defines the Text-to-CQL task to translate natural language into Corpus Query Language (CQL) for richly annotated corpora. It introduces TCQL, a large-scale NL–CQL dataset derived from EnWiki and TCFL via collocation-driven templates, and evaluates both in-context LLM prompting and fine-tuned PLMs, plus a novel CQLBLEU metric that blends syntactic and semantic similarity. The study shows that LLMs alone struggle to generate correct CQL, but properly prompted prompts and especially fine-tuned models yield stronger performance, highlighting a path toward broader access to linguistic corpora through automated query generation. The findings have practical implications for linguistic research and corpus querying, facilitating more efficient, scalable, and accurate exploration of annotated text data across languages.

Abstract

Natural Language Processing (NLP) technologies have revolutionized the way we interact with information systems, with a significant focus on converting natural language queries into formal query languages such as SQL. However, less emphasis has been placed on the Corpus Query Language (CQL), a critical tool for linguistic research and detailed analysis within text corpora. The manual construction of CQL queries is a complex and time-intensive task that requires a great deal of expertise, which presents a notable challenge for both researchers and practitioners. This paper presents the first text-to-CQL task that aims to automate the translation of natural language into CQL. We present a comprehensive framework for this task, including a specifically curated large-scale dataset and methodologies leveraging large language models (LLMs) for effective text-to-CQL task. In addition, we established advanced evaluation metrics to assess the syntactic and semantic accuracy of the generated queries. We created innovative LLM-based conversion approaches and detailed experiments. The results demonstrate the efficacy of our methods and provide insights into the complexities of text-to-CQL task.

From Text to CQL: Bridging Natural Language and Corpus Search Engine

TL;DR

This work defines the Text-to-CQL task to translate natural language into Corpus Query Language (CQL) for richly annotated corpora. It introduces TCQL, a large-scale NL–CQL dataset derived from EnWiki and TCFL via collocation-driven templates, and evaluates both in-context LLM prompting and fine-tuned PLMs, plus a novel CQLBLEU metric that blends syntactic and semantic similarity. The study shows that LLMs alone struggle to generate correct CQL, but properly prompted prompts and especially fine-tuned models yield stronger performance, highlighting a path toward broader access to linguistic corpora through automated query generation. The findings have practical implications for linguistic research and corpus querying, facilitating more efficient, scalable, and accurate exploration of annotated text data across languages.

Abstract

Natural Language Processing (NLP) technologies have revolutionized the way we interact with information systems, with a significant focus on converting natural language queries into formal query languages such as SQL. However, less emphasis has been placed on the Corpus Query Language (CQL), a critical tool for linguistic research and detailed analysis within text corpora. The manual construction of CQL queries is a complex and time-intensive task that requires a great deal of expertise, which presents a notable challenge for both researchers and practitioners. This paper presents the first text-to-CQL task that aims to automate the translation of natural language into CQL. We present a comprehensive framework for this task, including a specifically curated large-scale dataset and methodologies leveraging large language models (LLMs) for effective text-to-CQL task. In addition, we established advanced evaluation metrics to assess the syntactic and semantic accuracy of the generated queries. We created innovative LLM-based conversion approaches and detailed experiments. The results demonstrate the efficacy of our methods and provide insights into the complexities of text-to-CQL task.
Paper Structure (36 sections, 4 equations, 1 figure, 13 tables, 1 algorithm)

This paper contains 36 sections, 4 equations, 1 figure, 13 tables, 1 algorithm.

Figures (1)

  • Figure 1: Example task diagram. Given any input natural language query description, the model is expected to convert it into the corresponding Corpus Query Language (CQL) and the generated CQL should be able to be accurately executed by the Corpus Engine. The CQL uses symbols (in green) with a small number of CQL uses symbols (green) and a small number of keywords (purple) to construct queries, and allows to specify names (blue) for tokens to constrain relationships between tokens. Execution Results are the results returned by Blacklab when executing the CQL on the En-Wiki corpus.