exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem
Sajad Ebrahimi, Sara Salamat, Negar Arabzadeh, Mahdi Bashari, Ebrahim Bagheri
TL;DR
exHarmony reframes the Reviewer Assignment Problem as an information retrieval task and builds a large, OpenAlex-based benchmark without explicit reviewer labels. It defines six self-supervised gold standards across authors, cited authors, and top-k similar citations, with an established-author threshold of at least $15$ publications. The study benchmarks lexical, static embedding, and contextualized neural baselines, plus dense retrievers, finding that contextualized embeddings trained on scholarly text (e.g., SPECTER, SciBERT) yield the best relevance and diversity signals. The dataset and code are released to support reproducible evaluation and guide future improvements in fair and diverse reviewer assignment.
Abstract
The peer review process is crucial for ensuring the quality and reliability of scholarly work, yet assigning suitable reviewers remains a significant challenge. Traditional manual methods are labor-intensive and often ineffective, leading to nonconstructive or biased reviews. This paper introduces the exHarmony (eHarmony but for connecting experts to manuscripts) benchmark, designed to address these challenges by re-imagining the Reviewer Assignment Problem (RAP) as a retrieval task. Utilizing the extensive data from OpenAlex, we propose a novel approach that considers a host of signals from the authors, most similar experts, and the citation relations as potential indicators for a suitable reviewer for a manuscript. This approach allows us to develop a standard benchmark dataset for evaluating the reviewer assignment problem without needing explicit labels. We benchmark various methods, including traditional lexical matching, static neural embeddings, and contextualized neural embeddings, and introduce evaluation metrics that assess both relevance and diversity in the context of RAP. Our results indicate that while traditional methods perform reasonably well, contextualized embeddings trained on scholarly literature show the best performance. The findings underscore the importance of further research to enhance the diversity and effectiveness of reviewer assignments.
