Event-Keyed Summarization
William Gantt, Alexander Martin, Pavlo Kuchmiichuk, Aaron Steven White
TL;DR
This work introduces event-keyed summarization (EKS), a task that generates targeted summaries for specific events described in a document by combining event extraction with abstractive summarization. It presents MUCSUM, a dataset built on the classic MUC-4 template annotations, enabling evaluation of summaries that must fuse document context with an event ontology; ablations confirm that both the document and the event template are necessary for high-quality, contextualized summaries. The authors benchmark fine-tuned large language model baselines (BART, T5, PEGASUS) and zero-shot prompting (ChatGPT, GPT-4), using ROUGE, BERTScore, CEAF-REE, and NLI-based metrics, complemented by human judgments. They find that while fine-tuned models benefit from joint document-template input and outperform ablations, zero-shot prompts can yield reasonable, albeit less overlapping, summaries, with human evaluation affirming that references remain superior. Overall, MUCSUM provides a robust, targeted benchmark for EKS and reveals meaningful interactions between input modality, model choice, and evaluation metrics, with practical implications for producing event-centered summaries in information-seeking contexts.
Abstract
We introduce event-keyed summarization (EKS), a novel task that marries traditional summarization and document-level event extraction, with the goal of generating a contextualized summary for a specific event, given a document and an extracted event structure. We introduce a dataset for this task, MUCSUM, consisting of summaries of all events in the classic MUC-4 dataset, along with a set of baselines that comprises both pretrained LM standards in the summarization literature, as well as larger frontier models. We show that ablations that reduce EKS to traditional summarization or structure-to-text yield inferior summaries of target events and that MUCSUM is a robust benchmark for this task. Lastly, we conduct a human evaluation of both reference and model summaries, and provide some detailed analysis of the results.
