Table of Contents
Fetching ...

Explaining Mixtures of Sources in News Articles

Alexander Spangher, James Youn, Matt DeButts, Nanyun Peng, Emilio Ferrara, Jonathan May

TL;DR

This work adapts five existing schemata and introduces three new ones to describe journalistic plans for the inclusion of sources in documents, and develops metrics to select the most likely plan, or schema, underlying a story, which are used to compare schemata.

Abstract

Human writers plan, then write. For large language models (LLMs) to play a role in longer-form article generation, we must understand the planning steps humans make before writing. We explore one kind of planning, source-selection in news, as a case-study for evaluating plans in long-form generation. We ask: why do specific stories call for specific kinds of sources? We imagine a generative process for story writing where a source-selection schema is first selected by a journalist, and then sources are chosen based on categories in that schema. Learning the article's plan means predicting the schema initially chosen by the journalist. Working with professional journalists, we adapt five existing schemata and introduce three new ones to describe journalistic plans for the inclusion of sources in documents. Then, inspired by Bayesian latent-variable modeling, we develop metrics to select the most likely plan, or schema, underlying a story, which we use to compare schemata. We find that two schemata: stance and social affiliation best explain source plans in most documents. However, other schemata like textual entailment explain source plans in factually rich topics like "Science". Finally, we find we can predict the most suitable schema given just the article's headline with reasonable accuracy. We see this as an important case-study for human planning, and provides a framework and approach for evaluating other kinds of plans. We release a corpora, NewsSources, with annotations for 4M articles.

Explaining Mixtures of Sources in News Articles

TL;DR

This work adapts five existing schemata and introduces three new ones to describe journalistic plans for the inclusion of sources in documents, and develops metrics to select the most likely plan, or schema, underlying a story, which are used to compare schemata.

Abstract

Human writers plan, then write. For large language models (LLMs) to play a role in longer-form article generation, we must understand the planning steps humans make before writing. We explore one kind of planning, source-selection in news, as a case-study for evaluating plans in long-form generation. We ask: why do specific stories call for specific kinds of sources? We imagine a generative process for story writing where a source-selection schema is first selected by a journalist, and then sources are chosen based on categories in that schema. Learning the article's plan means predicting the schema initially chosen by the journalist. Working with professional journalists, we adapt five existing schemata and introduce three new ones to describe journalistic plans for the inclusion of sources in documents. Then, inspired by Bayesian latent-variable modeling, we develop metrics to select the most likely plan, or schema, underlying a story, which we use to compare schemata. We find that two schemata: stance and social affiliation best explain source plans in most documents. However, other schemata like textual entailment explain source plans in factually rich topics like "Science". Finally, we find we can predict the most suitable schema given just the article's headline with reasonable accuracy. We see this as an important case-study for human planning, and provides a framework and approach for evaluating other kinds of plans. We release a corpora, NewsSources, with annotations for 4M articles.

Paper Structure

This paper contains 54 sections, 8 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: We seek to infer unobserved plans, or schemata, in natural data, focusing on one scenario: source-selection made by human journalists during news writing. Although the reasons why sources are chosen are unobservable, we show that one explanation (in the diagram, represented by squares: { , , }), is preferred over another (represented by circles: { , , }) if it better predicts the observed text (conditional perplexity) and the explanation is more internally consistent (posterior predictive). Our paper is divided into two parts: in the first part (i.e. Section \ref{['subsct:source_schemata']} and Section \ref{['sct:classifier-training']}), we introduce the different schemata we will compare -- i.e. the top half of this diagram. In the first part (i.e. Section \ref{['sct:comparing_schemata']} and Section \ref{['sct:predicting_schemata']}) we determine the right schema for a datum among competing schemata -- i.e. the bottom half of this diagram -- and, given minimal information about a document, we show that we can predict what schema should be used.
  • Figure 2: Label-sets for source-planning schemata. Extrinsic Source Schemata Affiliation, Role and Retrieval-method spangher2023identifying capture characteristics of sources extrinsic to their usage in the document. Functional Source Schemata: Argumentation al2016news, Discourse choubey2020discourse and Identity capture functional narrative role of sources. Debate-Oriented Schemata: Natural Language Inference (NLI) dagan2005pascal and Stance hardalov2021cross capture the role of sources in encompassing multiple sides. The three novel schemata we introduce are shown with borders: Affiliation, Identity and Role. For definitions, see Appendix \ref{['app:schema_definitions']}.
  • Figure 3: Correlation between 8 schemata, measured as Cramer's V cramer1999mathematical, or the effect-size measurement of the $\chi^2$ test of independence.
  • Figure 4: Stance and NLI schemata definitions are not very aligned. We show conditional probability of labels in each schema, $p(x|y)$ where $x=$ Stance and $y=$ NLI.
  • Figure 5: Distribution of conditional perplexity measurements across different experimental groups.
  • ...and 3 more figures