Research Paper Recommender System by Considering Users' Information Seeking Behaviors
Zhelin Xu, Shuhei Yamamoto, Hideo Joho
TL;DR
The paper addresses information overload in scientific literature search by moving beyond global content similarity to a section-aware content-based filtering approach. It learns a paper representation from both the overall abstract content and weighted signals from the background, method, and results sections, with weights determined by a learned attention mechanism and title augmentation. The model uses SPECTER embeddings, a multi-head attention module, and a nonlinear MLP, trained with a triplet loss that emphasizes hard negatives, achieving state-of-the-art results on the DBLP dataset (MAP ≈ 0.808 and recall@5 ≈ 0.813). This section-aware representation improves relevance and ranking, offering practical benefits for novice researchers and scalable literature discovery, with future work aiming to validate weights via user studies and extend to larger datasets.
Abstract
With the rapid growth of scientific publications, researchers need to spend more time and effort searching for papers that align with their research interests. To address this challenge, paper recommendation systems have been developed to help researchers in effectively identifying relevant paper. One of the leading approaches to paper recommendation is content-based filtering method. Traditional content-based filtering methods recommend relevant papers to users based on the overall similarity of papers. However, these approaches do not take into account the information seeking behaviors that users commonly employ when searching for literature. Such behaviors include not only evaluating the overall similarity among papers, but also focusing on specific sections, such as the method section, to ensure that the approach aligns with the user's interests. In this paper, we propose a content-based filtering recommendation method that takes this information seeking behavior into account. Specifically, in addition to considering the overall content of a paper, our approach also takes into account three specific sections (background, method, and results) and assigns weights to them to better reflect user preferences. We conduct offline evaluations on the publicly available DBLP dataset, and the results demonstrate that the proposed method outperforms six baseline methods in terms of precision, recall, F1-score, MRR, and MAP.
