Table of Contents
Fetching ...

PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing

Zhuoqun Li, Xuanang Chen, Hongyu Lin, Yaojie Lu, Xianpei Han, Shanshan Jiang, Bin Dong, Le Sun

TL;DR

PaperRegister tackles the challenge of flexible-grained paper search by replacing traditional abstract-based indexing with a hierarchical register index that spans multiple granularity levels. It offline-builds a hierarchical index tree using a hierarchical register schema, extracting fine-grained content via LLMs and aggregating it bottom-up to create per-paper registers, then merging into a corpus-wide index $\\mathcal{I}_h$. Online, a view recognizer identifies query views with low latency and high accuracy using supervised fine-tuning and hierarchical-reward GRPO, followed by view-based matching to retrieve relevant papers. Experiments across coarse to very fine-grained queries show state-of-the-art performance, especially for fine-grained tasks, and demonstrate compatibility with complex frameworks like PaSa and practical online efficiency. This approach offers a scalable, real-world solution for precise, multi-granularity paper retrieval.

Abstract

As researchers delve more deeply into their work, paper search requirements may become more flexible, sometimes involving specific details such as module configuration rather than being limited to coarse-grained topics. However, previous paper search systems are unable to meet these flexible-grained requirements, as previous systems mainly collect paper abstract to construct corpus index, which lacks detailed information to support retrieval by some finer-grained queries. In this work, we propose PaperRegister, which transforms traditional abstract-based index into a hierarchical index tree, thereby supporting queries at flexible granularity. Experiments on paper search tasks across a range of granularity demonstrate that PaperRegister achieves the SOTA performance, and particularly excels in the fine-grained scenarios, highlighting good potential as an effective solution for flexible-grained paper search in real-world applications. https://github.com/Li-Z-Q/PaperRegister.

PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing

TL;DR

PaperRegister tackles the challenge of flexible-grained paper search by replacing traditional abstract-based indexing with a hierarchical register index that spans multiple granularity levels. It offline-builds a hierarchical index tree using a hierarchical register schema, extracting fine-grained content via LLMs and aggregating it bottom-up to create per-paper registers, then merging into a corpus-wide index . Online, a view recognizer identifies query views with low latency and high accuracy using supervised fine-tuning and hierarchical-reward GRPO, followed by view-based matching to retrieve relevant papers. Experiments across coarse to very fine-grained queries show state-of-the-art performance, especially for fine-grained tasks, and demonstrate compatibility with complex frameworks like PaSa and practical online efficiency. This approach offers a scalable, real-world solution for precise, multi-granularity paper retrieval.

Abstract

As researchers delve more deeply into their work, paper search requirements may become more flexible, sometimes involving specific details such as module configuration rather than being limited to coarse-grained topics. However, previous paper search systems are unable to meet these flexible-grained requirements, as previous systems mainly collect paper abstract to construct corpus index, which lacks detailed information to support retrieval by some finer-grained queries. In this work, we propose PaperRegister, which transforms traditional abstract-based index into a hierarchical index tree, thereby supporting queries at flexible granularity. Experiments on paper search tasks across a range of granularity demonstrate that PaperRegister achieves the SOTA performance, and particularly excels in the fine-grained scenarios, highlighting good potential as an effective solution for flexible-grained paper search in real-world applications. https://github.com/Li-Z-Q/PaperRegister.

Paper Structure

This paper contains 36 sections, 12 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: PaperRegister supports flexible-grained paper search via hierarchical register, while traditional method fails due to abstract cannot contain required details.
  • Figure 2: PaperRegister includes hierarchical indexing and adaptive retrieval. Offline, PaperRegister constructs hierarchical index tree via fine-grained content extracting and bottom-up content aggregating based on a hierarchical register schema. Online, PaperRegister first identify views of query and then conduct view-based matching.
  • Figure 3: Illustration of view recognizer training, including SFT and GRPO via hierarchical reward, which is calculated based on the closeness level of predicted view and golden view in the hierarchical register schema.
  • Figure 4: Performance of PaperRegister with different view recognizer. The figure shows a strong recognizer is with obvious positive impact on the overall system.
  • Figure 5: Performance of combining PaperRegister into PaSa framework. The figure shows that PaperRegister can greatly cooperate with complex modules in PaSa.
  • ...and 5 more figures