Table of Contents
Fetching ...

A Novel Method for News Article Event-Based Embedding

Koren Ishlach, Itzhak Ben-David, Michael Fire, Lior Rokach

TL;DR

The paper tackles the limitation of existing news embeddings that often ignore temporal and event-centric signals. It introduces a three-stage pipeline: (1) extract entities/themes/events from articles, (2) build time-aware embeddings via time-sliced co-occurrence learning using an adapted GloVe with a pooling mechanism, and (3) generate article embeddings through both SIF and a semi-supervised Siamese network, then fuse them by concatenation. Key contributions include large-scale time-relevant entity/theme embeddings, a dual-embedding strategy with a synergy step, and a novel evaluation framework based on pairwise common-event attribution, all demonstrated on over 850,000 articles and 1,000,000 events from GDELT. The results show that the concatenated embeddings outperform baseline SIF and Siamese-alone approaches across daily and monthly settings, offering a scalable, CPU-friendly approach for event-centric news analysis with practical implications for bias detection, fake-news identification, and recommendations.

Abstract

Embedding news articles is a crucial tool for multiple fields, such as media bias detection, identifying fake news, and making news recommendations. However, existing news embedding methods are not optimized to capture the latent context of news events. Most embedding methods rely on full-text information and neglect time-relevant embedding generation. In this paper, we propose a novel lightweight method that optimizes news embedding generation by focusing on entities and themes mentioned in articles and their historical connections to specific events. We suggest a method composed of three stages. First, we process and extract events, entities, and themes from the given news articles. Second, we generate periodic time embeddings for themes and entities by training time-separated GloVe models on current and historical data. Lastly, we concatenate the news embeddings generated by two distinct approaches: Smooth Inverse Frequency (SIF) for article-level vectors and Siamese Neural Networks for embeddings with nuanced event-related information. We leveraged over 850,000 news articles and 1,000,000 events from the GDELT project to test and evaluate our method. We conducted a comparative analysis of different news embedding generation methods for validation. Our experiments demonstrate that our approach can both improve and outperform state-of-the-art methods on shared event detection tasks.

A Novel Method for News Article Event-Based Embedding

TL;DR

The paper tackles the limitation of existing news embeddings that often ignore temporal and event-centric signals. It introduces a three-stage pipeline: (1) extract entities/themes/events from articles, (2) build time-aware embeddings via time-sliced co-occurrence learning using an adapted GloVe with a pooling mechanism, and (3) generate article embeddings through both SIF and a semi-supervised Siamese network, then fuse them by concatenation. Key contributions include large-scale time-relevant entity/theme embeddings, a dual-embedding strategy with a synergy step, and a novel evaluation framework based on pairwise common-event attribution, all demonstrated on over 850,000 articles and 1,000,000 events from GDELT. The results show that the concatenated embeddings outperform baseline SIF and Siamese-alone approaches across daily and monthly settings, offering a scalable, CPU-friendly approach for event-centric news analysis with practical implications for bias detection, fake-news identification, and recommendations.

Abstract

Embedding news articles is a crucial tool for multiple fields, such as media bias detection, identifying fake news, and making news recommendations. However, existing news embedding methods are not optimized to capture the latent context of news events. Most embedding methods rely on full-text information and neglect time-relevant embedding generation. In this paper, we propose a novel lightweight method that optimizes news embedding generation by focusing on entities and themes mentioned in articles and their historical connections to specific events. We suggest a method composed of three stages. First, we process and extract events, entities, and themes from the given news articles. Second, we generate periodic time embeddings for themes and entities by training time-separated GloVe models on current and historical data. Lastly, we concatenate the news embeddings generated by two distinct approaches: Smooth Inverse Frequency (SIF) for article-level vectors and Siamese Neural Networks for embeddings with nuanced event-related information. We leveraged over 850,000 news articles and 1,000,000 events from the GDELT project to test and evaluate our method. We conducted a comparative analysis of different news embedding generation methods for validation. Our experiments demonstrate that our approach can both improve and outperform state-of-the-art methods on shared event detection tasks.
Paper Structure (39 sections, 5 equations, 12 figures, 6 tables)

This paper contains 39 sections, 5 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: This figure presents the method's entire pipeline of news embedding generation.
  • Figure 2: This figure contained a Pareto chart of the Top 35 persons occurrences in the full collected dataset from GDELT.
  • Figure 3: The architecture and training process of each Triplet Siamese Network.
  • Figure 4: This figure maps each monitored media source to the articles published by it in the preprocessed dataset. Conservative and Liberal media sources are labeled in blue and red, respectively.
  • Figure 5: Siamese-Network: Train Triplet Loss. The X-axis is the monitored training steps; for every 4 steps, the average loss was calculated. The labels represent each Siamese model that was trained for a given month.
  • ...and 7 more figures