Table of Contents
Fetching ...

Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles

Samia Touileb, Vladislav Mikhailov, Marie Kroka, Lilja Øvrelid, Erik Velldal

TL;DR

This paper introduces an open benchmark dataset for Norwegian abstractive summarisation, featuring three human-authored summaries per news article in both Bokmål and Nynorsk. Data are drawn from the EDEN corpus and annotated by three domain-expert Norwegian speakers, with a two-round process including translations across language variants. The authors evaluate nine open-source Norwegian LLMs using a zero-shot noreval framework with six prompts per variant, reporting ROUGE-L and BERTScore alongside a manual human evaluation. Results show the task is challenging for current models, with Viking models performing best among open sources, yet human summaries remain superior in relevance and quality. The work establishes a reproducible, high-quality benchmark for advancing Norwegian summarisation research and invites future model development to bridge the gap to human-level performance.

Abstract

We introduce a dataset of high-quality human-authored summaries of news articles in Norwegian. The dataset is intended for benchmarking the abstractive summarisation capabilities of generative language models. Each document in the dataset is provided with three different candidate gold-standard summaries written by native Norwegian speakers, and all summaries are provided in both of the written variants of Norwegian -- Bokmål and Nynorsk. The paper describes details on the data creation effort as well as an evaluation of existing open LLMs for Norwegian on the dataset. We also provide insights from a manual human evaluation, comparing human-authored to model-generated summaries. Our results indicate that the dataset provides a challenging LLM benchmark for Norwegian summarisation capabilities

Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles

TL;DR

This paper introduces an open benchmark dataset for Norwegian abstractive summarisation, featuring three human-authored summaries per news article in both Bokmål and Nynorsk. Data are drawn from the EDEN corpus and annotated by three domain-expert Norwegian speakers, with a two-round process including translations across language variants. The authors evaluate nine open-source Norwegian LLMs using a zero-shot noreval framework with six prompts per variant, reporting ROUGE-L and BERTScore alongside a manual human evaluation. Results show the task is challenging for current models, with Viking models performing best among open sources, yet human summaries remain superior in relevance and quality. The work establishes a reproducible, high-quality benchmark for advancing Norwegian summarisation research and invites future model development to bridge the gap to human-level performance.

Abstract

We introduce a dataset of high-quality human-authored summaries of news articles in Norwegian. The dataset is intended for benchmarking the abstractive summarisation capabilities of generative language models. Each document in the dataset is provided with three different candidate gold-standard summaries written by native Norwegian speakers, and all summaries are provided in both of the written variants of Norwegian -- Bokmål and Nynorsk. The paper describes details on the data creation effort as well as an evaluation of existing open LLMs for Norwegian on the dataset. We also provide insights from a manual human evaluation, comparing human-authored to model-generated summaries. Our results indicate that the dataset provides a challenging LLM benchmark for Norwegian summarisation capabilities
Paper Structure (22 sections, 2 figures, 4 tables)

This paper contains 22 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Box plots of summary lengths authored by three different annotators (referred to as A1, A2, and A3) in either Bokmål (BM) or Nynorsk (NN).
  • Figure 2: Screenshot of the interface used during human evaluation. We present a news article on top, and two suggestions for summaries. The goal for the evaluator is to choose the summary they prefer based on simple criteria (see §\ref{['sec:human_eval']}).