Table of Contents
Fetching ...

NewsHomepages: Homepage Layouts Capture Information Prioritization Decisions

Ben Welsh, Naitian Zhou, Arda Kaz, Michael Vu, Alexander Spangher

TL;DR

NewsHomepages introduces a large-scale dataset of over 3,000 news homepage layouts captured over three years to study information prioritization. It combines a weakly-supervised bounding-box bootstrap with pairwise article comparisons to infer editorial significance from layout cues such as size and position. The work demonstrates two practical demonstrations: cross-outlet newsworthiness agreement and surfacing newsworthy leads in non-news corpora (SF policies) with LLM-based summaries, highlighting cross-domain transferability. Together, these results show that homepage editorial cues reflect latent organizational priorities and offer tools for journalists and researchers to analyze information prioritization at scale.

Abstract

Information prioritization plays an important role in how humans perceive and understand the world. Homepage layouts serve as a tangible proxy for this prioritization. In this work, we present NewsHomepages, a large dataset of over 3,000 new website homepages (including local, national and topic-specific outlets) captured twice daily over a three-year period. We develop models to perform pairwise comparisons between news items to infer their relative significance. To illustrate that modeling organizational hierarchies has broader implications, we applied our models to rank-order a collection of local city council policies passed over a ten-year period in San Francisco, assessing their "newsworthiness". Our findings lay the groundwork for leveraging implicit organizational cues to deepen our understanding of information prioritization.

NewsHomepages: Homepage Layouts Capture Information Prioritization Decisions

TL;DR

NewsHomepages introduces a large-scale dataset of over 3,000 news homepage layouts captured over three years to study information prioritization. It combines a weakly-supervised bounding-box bootstrap with pairwise article comparisons to infer editorial significance from layout cues such as size and position. The work demonstrates two practical demonstrations: cross-outlet newsworthiness agreement and surfacing newsworthy leads in non-news corpora (SF policies) with LLM-based summaries, highlighting cross-domain transferability. Together, these results show that homepage editorial cues reflect latent organizational priorities and offer tools for journalists and researchers to analyze information prioritization at scale.

Abstract

Information prioritization plays an important role in how humans perceive and understand the world. Homepage layouts serve as a tangible proxy for this prioritization. In this work, we present NewsHomepages, a large dataset of over 3,000 new website homepages (including local, national and topic-specific outlets) captured twice daily over a three-year period. We develop models to perform pairwise comparisons between news items to infer their relative significance. To illustrate that modeling organizational hierarchies has broader implications, we applied our models to rank-order a collection of local city council policies passed over a ten-year period in San Francisco, assessing their "newsworthiness". Our findings lay the groundwork for leveraging implicit organizational cues to deepen our understanding of information prioritization.
Paper Structure (33 sections, 8 figures, 8 tables)

This paper contains 33 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Two "newsworthiness" signals that editors make to guide reader attention are shown above. (1) Position (i.e. articles that are placed above, $\uparrow$, and left, $\leftarrow$ relative to other articles are more important hays2018analysis). (2) Size (i.e. articles that are larger than other articles are more important) (3) Graphics and Font (i.e. articles with graphics and images are more important). We release NewsHomepages, a large dataset of over 3,000 homepages, collected twice-daily over three years, study information prioritization in this setting. We show can model these decisions at scale and demonstrate the usefulness of these models on two downstream tasks.
  • Figure 2: Comparison of Kendall's $\tau$ rank correlation (on newsworthiness judgements) and SBERT cosine similarity (on articles) across news outlets.
  • Figure 3: We show three sections of a sample homepage (from CBS News) where editorial decisions for different reasons. We highlight the "Breaking News" Section, "Section Fronts" and "The Footer".
  • Figure 4: Illustration of our deterministic bootstrapping algorithm and a failure case. Here, when non-article links exist, we misunderstand the full area of an article, excluding the text below.
  • Figure 5: Different analyses we run on bounding boxes across time: average locations of bounding boxes on a homepage, locations where articles are added first, locations where they are removed, and the average time articles in various locations spend.
  • ...and 3 more figures