Table of Contents
Fetching ...

Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation

SeongKu Kang, Bowen Jin, Wonbin Kweon, Yu Zhang, Dongha Lee, Jiawei Han, Hwanjo Yu

TL;DR

Scientific document retrieval in specialized domains suffers from scarce labeled data and incomplete concept coverage in synthetic queries. CCQGen addresses this by a two-stage approach: (i) concept identification/enrichment using a taxonomy-driven core-topic and core-phrase extraction plus a multi-task concept extractor, and (ii) concept coverage-based query generation that adaptively conditions subsequent queries on uncovered concepts and applies CSR-based filtering. The framework improves training signal quality and retrieval performance, achieving significant gains over strong baselines across CSFCube and DORIS-MAE with both generalist and science-specific backbones, and remains effective with smaller LLMs. This work demonstrates that enforcing comprehensive concept coverage and leveraging concept-aligned retrieval signals can substantially enhance scientific document retrieval in data-scarce settings.

Abstract

In specialized fields like the scientific domain, constructing large-scale human-annotated datasets poses a significant challenge due to the need for domain expertise. Recent methods have employed large language models to generate synthetic queries, which serve as proxies for actual user queries. However, they lack control over the content generated, often resulting in incomplete coverage of academic concepts in documents. We introduce Concept Coverage-based Query set Generation (CCQGen) framework, designed to generate a set of queries with comprehensive coverage of the document's concepts. A key distinction of CCQGen is that it adaptively adjusts the generation process based on the previously generated queries. We identify concepts not sufficiently covered by previous queries, and leverage them as conditions for subsequent query generation. This approach guides each new query to complement the previous ones, aiding in a thorough understanding of the document. Extensive experiments demonstrate that CCQGen significantly enhances query quality and retrieval performance.

Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation

TL;DR

Scientific document retrieval in specialized domains suffers from scarce labeled data and incomplete concept coverage in synthetic queries. CCQGen addresses this by a two-stage approach: (i) concept identification/enrichment using a taxonomy-driven core-topic and core-phrase extraction plus a multi-task concept extractor, and (ii) concept coverage-based query generation that adaptively conditions subsequent queries on uncovered concepts and applies CSR-based filtering. The framework improves training signal quality and retrieval performance, achieving significant gains over strong baselines across CSFCube and DORIS-MAE with both generalist and science-specific backbones, and remains effective with smaller LLMs. This work demonstrates that enforcing comprehensive concept coverage and leveraging concept-aligned retrieval signals can substantially enhance scientific document retrieval in data-scarce settings.

Abstract

In specialized fields like the scientific domain, constructing large-scale human-annotated datasets poses a significant challenge due to the need for domain expertise. Recent methods have employed large language models to generate synthetic queries, which serve as proxies for actual user queries. However, they lack control over the content generated, often resulting in incomplete coverage of academic concepts in documents. We introduce Concept Coverage-based Query set Generation (CCQGen) framework, designed to generate a set of queries with comprehensive coverage of the document's concepts. A key distinction of CCQGen is that it adaptively adjusts the generation process based on the previously generated queries. We identify concepts not sufficiently covered by previous queries, and leverage them as conditions for subsequent query generation. This approach guides each new query to complement the previous ones, aiding in a thorough understanding of the document. Extensive experiments demonstrate that CCQGen significantly enhances query quality and retrieval performance.

Paper Structure

This paper contains 23 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: A conceptual comparison of (a) the existing approach for query set generation and (b) our concept coverage-based query set generation. Best viewed in color.
  • Figure 2: The overview of Concept Coverage-based Query set Generation (CCQGen) framework. Best viewed in color.
  • Figure 3: Results with varying amounts of training data. x% denotes setups using a random x% of generated queries.
  • Figure 4: Improvements by concept coverage-based filtering.