Scientific Opinion Summarization: Paper Meta-review Generation Dataset, Methods, and Evaluation
Qi Zeng, Mankeerat Sidhu, Ansel Blume, Hou Pong Chan, Lu Wang, Heng Ji
TL;DR
This work defines Scientific Opinion Summarization and introduces ORSUM, a large OpenReview-based dataset of paper meta-reviews and reviews across 47 venues. It proposes Checklist-guided Iterative Introspection (CGI^2), a multi-stage, evidence-grounded prompting framework to generate meta-reviews that capture consensus, controversy, and decision alignment. The authors establish an evaluation framework combining standard and LLM-based metrics and demonstrate that CGI^2 improves discussion involvement and reduces hallucinations relative to baselines, while highlighting challenges in achieving comprehensive meta-reviews. The study provides a scalable approach for complex opinion synthesis in scientific discourse with implications for automated meta-review generation and related tasks.
Abstract
Opinions in scientific research papers can be divergent, leading to controversies among reviewers. However, most existing datasets for opinion summarization are centered around product reviews and assume that the analyzed opinions are non-controversial, failing to account for the variability seen in other contexts such as academic papers, political debates, or social media discussions. To address this gap, we propose the task of scientific opinion summarization, where research paper reviews are synthesized into meta-reviews. To facilitate this task, we introduce the ORSUM dataset covering 15,062 paper meta-reviews and 57,536 paper reviews from 47 conferences. Furthermore, we propose the Checklist-guided Iterative Introspection approach, which breaks down scientific opinion summarization into several stages, iteratively refining the summary under the guidance of questions from a checklist. Our experiments show that (1) human-written summaries do not always satisfy all necessary criteria such as depth of discussion, and identifying consensus and controversy for the specific domain, and (2) the combination of task decomposition and iterative self-refinement shows strong potential for enhancing the opinions and can be applied to other complex text generation using black-box LLMs.
