From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization
Botond Barta, Dorina Lakatos, Attila Nagy, Milán Konor Nyist, Judit Ács
TL;DR
The paper presents HunSum-2, an open-source Hungarian corpus for both abstractive and extractive summarization, constructed from cleaned Common Crawl segments across 27 Hungarian news sites. It provides a robust preprocessing pipeline, deduplication, and sentence-level extractive labeling via sentence embeddings, plus baselines using mT5 and huBERT-based Bert2Bert for abstractive summarization and a BertSum-based extractor for extractive summarization. Quantitative metrics show extractive models achieving higher ROUGE and BertScore than abstractive baselines, while qualitative evaluation reveals factuality concerns for abstractive outputs and favorable factual alignment for extractive ones. The dataset, models, and code are released to foster reproducibility and benchmarking, with future work aimed at improving factual correctness and cross-domain applicability in Hungarian NLP. The work advances resources for Hungarian summarization and offers a practical benchmark for evaluating new methods on a low-resource language.
Abstract
Training summarization models requires substantial amounts of training data. However for less resourceful languages like Hungarian, openly available models and datasets are notably scarce. To address this gap our paper introduces HunSum-2 an open-source Hungarian corpus suitable for training abstractive and extractive summarization models. The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning, preprocessing and deduplication. In addition to abstractive summarization we generate sentence-level labels for extractive summarization using sentence similarity. We train baseline models for both extractive and abstractive summarization using the collected dataset. To demonstrate the effectiveness of the trained models, we perform both quantitative and qualitative evaluation. Our dataset, models and code are publicly available, encouraging replication, further research, and real-world applications across various domains.
