Table of Contents
Fetching ...

NoticIA: A Clickbait Article Summarization Dataset in Spanish

Iker García-Ferrero, Begoña Altuna

TL;DR

NoticIA addresses the challenge of summarizing Spanish clickbait news by providing 850 headline–body–ultrasummary triplets written by humans. The authors evaluate a broad range of instruction-tuned LLMs in zero-shot settings and demonstrate that task-specific fine-tuning with ClickbaitFighter yields near-human performance with relatively small models. They show that pretraining data quality and instruction-following capabilities drive performance more than sheer parameter count in zero-shot scenarios, and provide evidence that a compact 2B model can outperform many baselines when specialized. The dataset thus advances Spanish NLP benchmarking for information extraction and retrieval tasks and enables scalable development of specialized summarization tools for Spanish-language media.

Abstract

We present NoticIA, a dataset consisting of 850 Spanish news articles featuring prominent clickbait headlines, each paired with high-quality, single-sentence generative summarizations written by humans. This task demands advanced text understanding and summarization abilities, challenging the models' capacity to infer and connect diverse pieces of information to meet the user's informational needs generated by the clickbait headline. We evaluate the Spanish text comprehension capabilities of a wide range of state-of-the-art large language models. Additionally, we use the dataset to train ClickbaitFighter, a task-specific model that achieves near-human performance in this task.

NoticIA: A Clickbait Article Summarization Dataset in Spanish

TL;DR

NoticIA addresses the challenge of summarizing Spanish clickbait news by providing 850 headline–body–ultrasummary triplets written by humans. The authors evaluate a broad range of instruction-tuned LLMs in zero-shot settings and demonstrate that task-specific fine-tuning with ClickbaitFighter yields near-human performance with relatively small models. They show that pretraining data quality and instruction-following capabilities drive performance more than sheer parameter count in zero-shot scenarios, and provide evidence that a compact 2B model can outperform many baselines when specialized. The dataset thus advances Spanish NLP benchmarking for information extraction and retrieval tasks and enables scalable development of specialized summarization tools for Spanish-language media.

Abstract

We present NoticIA, a dataset consisting of 850 Spanish news articles featuring prominent clickbait headlines, each paired with high-quality, single-sentence generative summarizations written by humans. This task demands advanced text understanding and summarization abilities, challenging the models' capacity to infer and connect diverse pieces of information to meet the user's informational needs generated by the clickbait headline. We evaluate the Spanish text comprehension capabilities of a wide range of state-of-the-art large language models. Additionally, we use the dataset to train ClickbaitFighter, a task-specific model that achieves near-human performance in this task.
Paper Structure (26 sections, 8 figures, 8 tables)

This paper contains 26 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Examples of clickbait headlines from NoticIA. The headline is followed by a long article in which the answer to the headline is located at the end of the article. We translated two examples into English for illustration.
  • Figure 2: Category of the articles in the dataset.
  • Figure 3: Input prompt used to generate summaries. The prompt defines the task and guidelines.
  • Figure 4: ROUGE score and average summary lengths for all models evaluated in our dataset. The Y-axis represents the ROUGE score, while the X-axis indicates the average number of words in the summaries. A higher ROUGE score and a shorter summary length are considered optimal.
  • Figure 5: ROUGE scores of the models when evaluated against the gold summaries and the validation summaries produced by the second annotator.
  • ...and 3 more figures