AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization
Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, Eugene Ie
TL;DR
AQuaMuSe tackles the scarcity of query-based multi-document summarization data by automatically mining long-form, multi-document targets from QA datasets and web corpora. The approach leverages sentence embeddings to align long-form answers with passages from Common Crawl, producing a dual dataset suitable for abstractive and extractive qMDS, with configurable thresholds and top-K retrieval. The authors release a 5,519-example dataset and validate it through baseline abstractive and extractive experiments, complemented by human evaluation that demonstrates both the quality and the headroom for improvement. This work provides a scalable, reusable framework that can extend to other QA sources and web corpora, potentially accelerating progress in query-conditioned multi-document summarization research.
Abstract
Summarization is the task of compressing source document(s) into coherent and succinct passages. This is a valuable tool to present users with concise and accurate sketch of the top ranked documents related to their queries. Query-based multi-document summarization (qMDS) addresses this pervasive need, but the research is severely limited due to lack of training and evaluation datasets as existing single-document and multi-document summarization datasets are inadequate in form and scale. We propose a scalable approach called AQuaMuSe to automatically mine qMDS examples from question answering datasets and large document corpora. Our approach is unique in the sense that it can general a dual dataset -- for extractive and abstractive summaries both. We publicly release a specific instance of an AQuaMuSe dataset with 5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl. Extensive evaluation of the dataset along with baseline summarization model experiments are provided.
