Scientific Opinion Summarization: Paper Meta-review Generation Dataset, Methods, and Evaluation

Qi Zeng; Mankeerat Sidhu; Ansel Blume; Hou Pong Chan; Lu Wang; Heng Ji

Scientific Opinion Summarization: Paper Meta-review Generation Dataset, Methods, and Evaluation

Qi Zeng, Mankeerat Sidhu, Ansel Blume, Hou Pong Chan, Lu Wang, Heng Ji

TL;DR

This work defines Scientific Opinion Summarization and introduces ORSUM, a large OpenReview-based dataset of paper meta-reviews and reviews across 47 venues. It proposes Checklist-guided Iterative Introspection (CGI^2), a multi-stage, evidence-grounded prompting framework to generate meta-reviews that capture consensus, controversy, and decision alignment. The authors establish an evaluation framework combining standard and LLM-based metrics and demonstrate that CGI^2 improves discussion involvement and reduces hallucinations relative to baselines, while highlighting challenges in achieving comprehensive meta-reviews. The study provides a scalable approach for complex opinion synthesis in scientific discourse with implications for automated meta-review generation and related tasks.

Abstract

Opinions in scientific research papers can be divergent, leading to controversies among reviewers. However, most existing datasets for opinion summarization are centered around product reviews and assume that the analyzed opinions are non-controversial, failing to account for the variability seen in other contexts such as academic papers, political debates, or social media discussions. To address this gap, we propose the task of scientific opinion summarization, where research paper reviews are synthesized into meta-reviews. To facilitate this task, we introduce the ORSUM dataset covering 15,062 paper meta-reviews and 57,536 paper reviews from 47 conferences. Furthermore, we propose the Checklist-guided Iterative Introspection approach, which breaks down scientific opinion summarization into several stages, iteratively refining the summary under the guidance of questions from a checklist. Our experiments show that (1) human-written summaries do not always satisfy all necessary criteria such as depth of discussion, and identifying consensus and controversy for the specific domain, and (2) the combination of task decomposition and iterative self-refinement shows strong potential for enhancing the opinions and can be applied to other complex text generation using black-box LLMs.

Scientific Opinion Summarization: Paper Meta-review Generation Dataset, Methods, and Evaluation

TL;DR

Abstract

Paper Structure (19 sections, 5 figures, 4 tables)

This paper contains 19 sections, 5 figures, 4 tables.

Introduction
Related Work
Opinion Summarization
Meta-review Generation
Task Formulation
ORSUM Dataset
Dataset Collection and Preprocessing
Dataset Comparison
Composition Analysis
Checklist-guided Iterative Introspection Method for Meta-review Generation
Evaluation
Standard Metrics
LLM-based Metrics
Experiments
Baselines
...and 4 more sections

Figures (5)

Figure 1: Product meta-reviews and paper meta-reviews have different compositions: A product meta-review presents the most prominent opinion instead of summarizing opinions, while a paper meta-review summarizes different opinions and makes recommendations.
Figure 2: Meta-review composition. The scores range from 0 to 2: 0 indicates that the meta-review does not address the discussion at all. 1 signifies that the meta-review incorporates the discussion but lacks concrete evidence. 2 denotes that the meta-review involves a detailed discussion. Only 47.7% and 35.0% of meta-reviews meet the fundamental criteria for discussions of advantages and disadvantages, and consensus and controversy, respectively.
Figure 3: Our proposed CGI$^2$ framework operates through multiple iterations. In the initial iteration, the task is divided into four steps: (1) Review Opinion Extraction, (2) Strength and Weakness Synthesis, (3) Consensus and Controversy Analysis, and (4) Meta-review Drafting. For subsequent iterations, we present the black-box LLM with a query from a predefined list, acquire self-feedback, and request additional refinements.
Figure 4: We customize the prompts in G-EVAL and GPTLikert for evaluating meta-review generation to assess discussion involvement, opinion faithfulness, and decision consistency.
Figure 5: We show the meta-reviews from human, vanilla, CGI$^2$, and CGI$^2$ without iterative runs for the same paper. The yellow background indicates hallucinated content. The green background indicates redundant content.

Scientific Opinion Summarization: Paper Meta-review Generation Dataset, Methods, and Evaluation

TL;DR

Abstract

Scientific Opinion Summarization: Paper Meta-review Generation Dataset, Methods, and Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)