Which Information Matters? Dissecting Human-written Multi-document Summaries with Partial Information Decomposition
Laura Mascarell, Yan L'Homme, Majed El Helou
TL;DR
The paper investigates what information makes human-written multi-document summaries high quality by applying Partial Information Decomposition (PID). By treating sentences as sources and summaries as targets, it decomposes the information contributed by sources into union, redundancy, unique, and synergistic components, using a multivariate PID framework. Empirical analysis across several MDS datasets shows that redundancy decreases and unique information increases with more sources, while the order of sources matters, with the first three documents often contributing the majority of unique information. Surprisingly, synergy is negligible in typical MDS datasets but can be dominant in tasks like MultiRC when reframed for synergy analysis, indicating potential signals of joint information requirements or hallucination. The authors release Spider, a tool to quantify these information components, enabling more interpretable future MDS research and dataset construction aligned with human quality.
Abstract
Understanding the nature of high-quality summaries is crucial to further improve the performance of multi-document summarization. We propose an approach to characterize human-written summaries using partial information decomposition, which decomposes the mutual information provided by all source documents into union, redundancy, synergy, and unique information. Our empirical analysis on different MDS datasets shows that there is a direct dependency between the number of sources and their contribution to the summary.
