Table of Contents
Fetching ...

A Sentiment Consolidation Framework for Meta-Review Generation

Miao Li, Jey Han Lau, Eduard Hovy

TL;DR

This work introduces a three-layer sentiment consolidation framework for meta-review generation, addressing whether LLMs genuinely consolidate information across opinionated sources. It defines a hierarchical process (input, consolidation, generation), develops judgement extraction and sentiment fusion analyses, and proposes two sentiment-aware evaluation metrics (FacetEval and FusionEval). The authors validate prompting strategies that encode the consolidation logic, showing improved meta-reviews over naive prompts across automatic and human evaluations. They also provide a case study and discuss limitations and future directions, highlighting the potential of explicit sentiment consolidation to enhance scientific sentiment summarization in meta-reviews.

Abstract

Modern natural language generation systems with Large Language Models (LLMs) exhibit the capability to generate a plausible summary of multiple documents; however, it is uncertain if they truly possess the capability of information consolidation to generate summaries, especially on documents with opinionated information. We focus on meta-review generation, a form of sentiment summarisation for the scientific domain. To make scientific sentiment summarization more grounded, we hypothesize that human meta-reviewers follow a three-layer framework of sentiment consolidation to write meta-reviews. Based on the framework, we propose novel prompting methods for LLMs to generate meta-reviews and evaluation metrics to assess the quality of generated meta-reviews. Our framework is validated empirically as we find that prompting LLMs based on the framework -- compared with prompting them with simple instructions -- generates better meta-reviews.

A Sentiment Consolidation Framework for Meta-Review Generation

TL;DR

This work introduces a three-layer sentiment consolidation framework for meta-review generation, addressing whether LLMs genuinely consolidate information across opinionated sources. It defines a hierarchical process (input, consolidation, generation), develops judgement extraction and sentiment fusion analyses, and proposes two sentiment-aware evaluation metrics (FacetEval and FusionEval). The authors validate prompting strategies that encode the consolidation logic, showing improved meta-reviews over naive prompts across automatic and human evaluations. They also provide a case study and discuss limitations and future directions, highlighting the potential of explicit sentiment consolidation to enhance scientific sentiment summarization in meta-reviews.

Abstract

Modern natural language generation systems with Large Language Models (LLMs) exhibit the capability to generate a plausible summary of multiple documents; however, it is uncertain if they truly possess the capability of information consolidation to generate summaries, especially on documents with opinionated information. We focus on meta-review generation, a form of sentiment summarisation for the scientific domain. To make scientific sentiment summarization more grounded, we hypothesize that human meta-reviewers follow a three-layer framework of sentiment consolidation to write meta-reviews. Based on the framework, we propose novel prompting methods for LLMs to generate meta-reviews and evaluation metrics to assess the quality of generated meta-reviews. Our framework is validated empirically as we find that prompting LLMs based on the framework -- compared with prompting them with simple instructions -- generates better meta-reviews.
Paper Structure (33 sections, 2 equations, 5 figures, 16 tables)

This paper contains 33 sections, 2 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: The three-layer framework of the underlying information consolidation logic in meta-reviewing ($P$: Positive, $P^+$: Strongly positive, $N$: Negative, $N^+$: Strongly negative).
  • Figure 2: Inter-annotator agreement on meta-reviews, official reviews and discussions in terms of Krippendorff's $\alpha$ for different judgement components including Content Expression (CE), Sentiment Expression (SE), Review Facet (RF), Sentiment Level (SL), and Convincingness Level (CL).
  • Figure 3: The averaged GPT-4's agreement with two human annotators on meta-reviews, official reviews and discussions in terms of Krippendorff's $\alpha$ for different judgement components including Content Expression (CE), Sentiment Expression (SE), Review Facet (RF), Sentiment Level (SL), and Convincingness Level (CL).
  • Figure 4: The first page of the annotation instruction for human judgement annotation.
  • Figure 5: The first page of the annotation instruction for human judgement annotation.