Do Multi-Document Summarization Models Synthesize?

Jay DeYoung; Stephanie C. Martinez; Iain J. Marshall; Byron C. Wallace

Do Multi-Document Summarization Models Synthesize?

Jay DeYoung, Stephanie C. Martinez, Iain J. Marshall, Byron C. Wallace

TL;DR

The paper investigates whether neural multi-document summarization models truly synthesize conflicting inputs into an accurate aggregate summary. It formalizes synthesis via a latent property $z_i$ and an aggregation function $G(Z_i)$, and evaluates models on Rotten Tomatoes meta-reviews and Cochrane systematic reviews, showing partial synthesis with notable ordering sensitivity and composition insensitivity. A generate-diverse-then-select inference-time method is proposed to align outputs with the predicted aggregate, using an external measure and the option to abstain; this approach yields measurable gains in synthesis calibration, sometimes approaching human performance on these task-specific metrics. The work highlights the practical limits of current models for evidence synthesis and provides a concrete reranking-style strategy to surface better syntheses, with implications for evidence aggregation and narrative summaries accompanying scientific literature.

Abstract

Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key aspect, e.g., a synopsis of film reviews written about a particular movie should reflect the average critic consensus. As a more consequential example, narrative summaries that accompany biomedical systematic reviews of clinical trial results should accurately summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this sort of synthesis? We run experiments over opinion and evidence synthesis datasets using a suite of summarization models, from fine-tuned transformers to GPT-4. We find that existing models partially perform synthesis, but imperfectly: even the best performing models are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., ratio of positive to negative reviews). We propose a simple, general, effective method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate.

Do Multi-Document Summarization Models Synthesize?

TL;DR

The paper investigates whether neural multi-document summarization models truly synthesize conflicting inputs into an accurate aggregate summary. It formalizes synthesis via a latent property

and an aggregation function

, and evaluates models on Rotten Tomatoes meta-reviews and Cochrane systematic reviews, showing partial synthesis with notable ordering sensitivity and composition insensitivity. A generate-diverse-then-select inference-time method is proposed to align outputs with the predicted aggregate, using an external measure and the option to abstain; this approach yields measurable gains in synthesis calibration, sometimes approaching human performance on these task-specific metrics. The work highlights the practical limits of current models for evidence synthesis and provides a concrete reranking-style strategy to surface better syntheses, with implications for evidence aggregation and narrative summaries accompanying scientific literature.

Abstract

Paper Structure (12 sections, 9 figures, 13 tables)

This paper contains 12 sections, 9 figures, 13 tables.

Introduction
Synthesis and Summarization
Movie Reviews
Biomedical Systematic Reviews
Models
Experiments
Do Summarization Models Synthesize?
Sensitivity to Input Ordering
Sensitivity to Input Composition
Improving Synthesis in Summarization
Related Work
Conclusions

Figures (9)

Figure 1: Two multi-document summarization tasks where models must implicitly synthesize inputs to produce accurate summaries. Left: Summarizing film reviews with varying sentiment to yield a critics consensus. Right: Summarizing trials that have evaluated a particular medical invention.
Figure 2: Movie Reviews: Actual vs. Predicted Sentiments on generated summaries. Human outputs replace LED (upper left) for comparison.
Figure 3: The spread of sentiment/treatment effect measured in outputs produced from permuted input orderings. Left: Movie review sentiment. Right: Systematic review significance prediction entropy (0 indicates order insensitivity) on the subset of reviews that report significant effects.
Figure 4: ROUGE1 deltas from instance means for movie reviews (left) and systematic reviews (right).
Figure 5: Model sensitivity to manipulated input sentiment composition. Intensity patterns indicate that models oscillate between low and high sentiments in outputs, and are not responsive to subtler shifts in input sentiment. We show a model regression (blue) and the reference sensitivity regression (black).
...and 4 more figures

Do Multi-Document Summarization Models Synthesize?

TL;DR

Abstract

Do Multi-Document Summarization Models Synthesize?

Authors

TL;DR

Abstract

Table of Contents

Figures (9)