Table of Contents
Fetching ...

Author Unknown: Evaluating Performance of Author Extraction Libraries on Global Online News Articles

Sriharsha Hatwar, Virginia Partridge, Rahul Bhargava, Fernando Bermejo

TL;DR

Evaluation of five existing software packages and one customized model for author extraction shows evidence for Go-readability and Trafilatura as the most consistent solutions, but all packages produce highly variable results across languages.

Abstract

Analysis of large corpora of online news content requires robust validation of underlying metadata extraction methodologies. Identifying the author of a given web-based news article is one example that enables various types of research questions. While numerous solutions for off-the-shelf author extraction exist, there is little work comparing performance (especially in multilingual settings). In this paper we present a manually coded cross-lingual dataset of authors of online news articles and use it to evaluate the performance of five existing software packages and one customized model. Our evaluation shows evidence for Go-readability and Trafilatura as the most consistent solutions for author extraction, but we find all packages produce highly variable results across languages. These findings are relevant for researchers wishing to utilize author data in their analysis pipelines, primarily indicating that further validation for specific languages and geographies is required to rely on results.

Author Unknown: Evaluating Performance of Author Extraction Libraries on Global Online News Articles

TL;DR

Evaluation of five existing software packages and one customized model for author extraction shows evidence for Go-readability and Trafilatura as the most consistent solutions, but all packages produce highly variable results across languages.

Abstract

Analysis of large corpora of online news content requires robust validation of underlying metadata extraction methodologies. Identifying the author of a given web-based news article is one example that enables various types of research questions. While numerous solutions for off-the-shelf author extraction exist, there is little work comparing performance (especially in multilingual settings). In this paper we present a manually coded cross-lingual dataset of authors of online news articles and use it to evaluate the performance of five existing software packages and one customized model. Our evaluation shows evidence for Go-readability and Trafilatura as the most consistent solutions for author extraction, but we find all packages produce highly variable results across languages. These findings are relevant for researchers wishing to utilize author data in their analysis pipelines, primarily indicating that further validation for specific languages and geographies is required to rely on results.

Paper Structure

This paper contains 22 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Annotation interface in LabelStudio showing an author annotated in an HTML document as it would appear on the original website.
  • Figure 2: Rouge-1 scores for each library in each language. Higher is better. Dashes indicate instances of no character overlap.
  • Figure 3: Normalized edit distance scores for each library in each language. Lower is better.
  • Figure 4: Rouge-L scores for each library in each language. Higher is better. Dashes indicate instances of no character overlap.
  • Figure 5: Radar plot of ROUGE-L scores
  • ...and 1 more figures