Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation
SeongKu Kang, Bowen Jin, Wonbin Kweon, Yu Zhang, Dongha Lee, Jiawei Han, Hwanjo Yu
TL;DR
Scientific document retrieval in specialized domains suffers from scarce labeled data and incomplete concept coverage in synthetic queries. CCQGen addresses this by a two-stage approach: (i) concept identification/enrichment using a taxonomy-driven core-topic and core-phrase extraction plus a multi-task concept extractor, and (ii) concept coverage-based query generation that adaptively conditions subsequent queries on uncovered concepts and applies CSR-based filtering. The framework improves training signal quality and retrieval performance, achieving significant gains over strong baselines across CSFCube and DORIS-MAE with both generalist and science-specific backbones, and remains effective with smaller LLMs. This work demonstrates that enforcing comprehensive concept coverage and leveraging concept-aligned retrieval signals can substantially enhance scientific document retrieval in data-scarce settings.
Abstract
In specialized fields like the scientific domain, constructing large-scale human-annotated datasets poses a significant challenge due to the need for domain expertise. Recent methods have employed large language models to generate synthetic queries, which serve as proxies for actual user queries. However, they lack control over the content generated, often resulting in incomplete coverage of academic concepts in documents. We introduce Concept Coverage-based Query set Generation (CCQGen) framework, designed to generate a set of queries with comprehensive coverage of the document's concepts. A key distinction of CCQGen is that it adaptively adjusts the generation process based on the previously generated queries. We identify concepts not sufficiently covered by previous queries, and leverage them as conditions for subsequent query generation. This approach guides each new query to complement the previous ones, aiding in a thorough understanding of the document. Extensive experiments demonstrate that CCQGen significantly enhances query quality and retrieval performance.
