Table of Contents
Fetching ...

Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia

Farhan Samir, Chan Young Park, Anjalie Field, Vered Shwartz, Yulia Tsvetkov

TL;DR

The InfoGap method is introduced—an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level, across languages, and pinpoints local document- and fact-level information gaps, laying a new foundation for targeted and nuanced comparative language analysis at scale.

Abstract

To explain social phenomena and identify systematic biases, much research in computational social science focuses on comparative text analyses. These studies often rely on coarse corpus-level statistics or local word-level analyses, mainly in English. We introduce the InfoGap method -- an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level, across languages. We evaluate InfoGap by analyzing LGBT people's portrayals, across 2.7K biography pages on English, Russian, and French Wikipedias. We find large discrepancies in factual coverage across the languages. Moreover, our analysis reveals that biographical facts carrying negative connotations are more likely to be highlighted in Russian Wikipedia. Crucially, InfoGap both facilitates large scale analyses, and pinpoints local document- and fact-level information gaps, laying a new foundation for targeted and nuanced comparative language analysis at scale.

Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia

TL;DR

The InfoGap method is introduced—an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level, across languages, and pinpoints local document- and fact-level information gaps, laying a new foundation for targeted and nuanced comparative language analysis at scale.

Abstract

To explain social phenomena and identify systematic biases, much research in computational social science focuses on comparative text analyses. These studies often rely on coarse corpus-level statistics or local word-level analyses, mainly in English. We introduce the InfoGap method -- an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level, across languages. We evaluate InfoGap by analyzing LGBT people's portrayals, across 2.7K biography pages on English, Russian, and French Wikipedias. We find large discrepancies in factual coverage across the languages. Moreover, our analysis reveals that biographical facts carrying negative connotations are more likely to be highlighted in Russian Wikipedia. Crucially, InfoGap both facilitates large scale analyses, and pinpoints local document- and fact-level information gaps, laying a new foundation for targeted and nuanced comparative language analysis at scale.
Paper Structure (48 sections, 2 theorems, 3 equations, 3 figures, 9 tables)

This paper contains 48 sections, 2 theorems, 3 equations, 3 figures, 9 tables.

Key Result

Proposition 1

The probability of InfoGap making $k$ errors is $\leq \exp(-2(1-\epsilon)^2k)$, where $\epsilon$ is the error rate of the classifier when it predicts $F\not \vDash e_i$.

Figures (3)

  • Figure 1: We propose a method, InfoGap, to locate fact (mis)alignments in Wikipedia biographies in different language versions. InfoGap identifies facts that are common to a pair of articles ("Griner was born on October 18, 1990"), and facts unique to one language version ("Griner had recorded the sixth triple-double"; En only) enabling further analysis of information gaps, editors' selective preferences within articles, and analyses at scale across languages, cultures, and demographics.
  • Figure 2: Schematic of the InfoGap procedure. We describe the Fact Decomposition and Multilingual Alignment steps in §\ref{['sec:x-fact-retrieve']}, and the Alignment Verification step in §\ref{['sec:x-fact-eq']}.
  • Figure 3: Distribution of information overlaps for LGBTBioCorpus. Top: Distribution over the percentage of facts in En biographies also found in their Fr and Ru counterparts. Bottom: Distribution over the percentage of facts in Fr and Ru biographies also found in their English counterparts. $N=2,700$ biographies. In general, En biographies contain more facts that are exclusive to En.

Theorems & Definitions (3)

  • Proposition 1: Error Bound of Event Identification through InfoGap
  • proof
  • Theorem 1: Hoeffding's inequality