PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing
Zhuoqun Li, Xuanang Chen, Hongyu Lin, Yaojie Lu, Xianpei Han, Shanshan Jiang, Bin Dong, Le Sun
TL;DR
PaperRegister tackles the challenge of flexible-grained paper search by replacing traditional abstract-based indexing with a hierarchical register index that spans multiple granularity levels. It offline-builds a hierarchical index tree using a hierarchical register schema, extracting fine-grained content via LLMs and aggregating it bottom-up to create per-paper registers, then merging into a corpus-wide index $\\mathcal{I}_h$. Online, a view recognizer identifies query views with low latency and high accuracy using supervised fine-tuning and hierarchical-reward GRPO, followed by view-based matching to retrieve relevant papers. Experiments across coarse to very fine-grained queries show state-of-the-art performance, especially for fine-grained tasks, and demonstrate compatibility with complex frameworks like PaSa and practical online efficiency. This approach offers a scalable, real-world solution for precise, multi-granularity paper retrieval.
Abstract
As researchers delve more deeply into their work, paper search requirements may become more flexible, sometimes involving specific details such as module configuration rather than being limited to coarse-grained topics. However, previous paper search systems are unable to meet these flexible-grained requirements, as previous systems mainly collect paper abstract to construct corpus index, which lacks detailed information to support retrieval by some finer-grained queries. In this work, we propose PaperRegister, which transforms traditional abstract-based index into a hierarchical index tree, thereby supporting queries at flexible granularity. Experiments on paper search tasks across a range of granularity demonstrate that PaperRegister achieves the SOTA performance, and particularly excels in the fine-grained scenarios, highlighting good potential as an effective solution for flexible-grained paper search in real-world applications. https://github.com/Li-Z-Q/PaperRegister.
