Table of Contents
Fetching ...

SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

Elena Shushkevich, Long Mai, Manuel V. Loureiro, Steven Derby, Tri Kurniawan Wijaya

TL;DR

This work tackles the challenge of news similarity detection beyond crude topic-based heuristics by introducing SPICED, a seven-topic, complexity-aware dataset derived from WikiNews. It combines SimHash-based candidate generation with SBERT refinement and expert annotation to produce 977 gold-standard similar-news pairs across 1,954 articles, organized into four complexity levels and 32 dataset variants. Benchmarking across MinHash, BERT, SBERT, and SimCSE shows that SBERT and SimCSE outperform traditional Hash-based methods, with inter-topic and intra-topic scenarios revealing the dataset's increased difficulty. The authors release the dataset publicly and envisage future multilingual expansion and cross-dataset comparisons (e.g., with SemEval-2022) to advance robust news similarity modeling in real-world, heterogeneous news ecosystems.

Abstract

The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a novel dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four different levels of complexity, specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.

SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

TL;DR

This work tackles the challenge of news similarity detection beyond crude topic-based heuristics by introducing SPICED, a seven-topic, complexity-aware dataset derived from WikiNews. It combines SimHash-based candidate generation with SBERT refinement and expert annotation to produce 977 gold-standard similar-news pairs across 1,954 articles, organized into four complexity levels and 32 dataset variants. Benchmarking across MinHash, BERT, SBERT, and SimCSE shows that SBERT and SimCSE outperform traditional Hash-based methods, with inter-topic and intra-topic scenarios revealing the dataset's increased difficulty. The authors release the dataset publicly and envisage future multilingual expansion and cross-dataset comparisons (e.g., with SemEval-2022) to advance robust news similarity modeling in real-world, heterogeneous news ecosystems.

Abstract

The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a novel dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four different levels of complexity, specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.
Paper Structure (15 sections, 5 tables)