Table of Contents
Fetching ...

Taboo and Collaborative Knowledge Production: Evidence from Wikipedia

Kaylea Champion, Benjamin Mako Hill

TL;DR

This study develops a Wiktionary-based method to identify taboo subjects and applies it to English Wikipedia to compare taboo and non-taboo articles. Using longitudinal Wikipedia data, the authors test five hypotheses about readership, contributions, quality, and identifiability, finding that taboo articles are more popular and higher quality yet attract more vandalism and low-quality edits. The results reveal a nuanced privacy-identity dynamic among contributors, with mixed support for identifiability concerns. The work offers design implications for privacy in peer-produced knowledge systems and provides a scalable approach to studying taboo across languages and platforms.

Abstract

By definition, people are reticent or even unwilling to talk about taboo subjects. Because subjects like sexuality, health, and violence are taboo in most cultures, important information on each of these subjects can be difficult to obtain. Are peer produced knowledge bases like Wikipedia a promising approach for providing people with information on taboo subjects? With its reliance on volunteers who might also be averse to taboo, can the peer production model produce high-quality information on taboo subjects? In this paper, we seek to understand the role of taboo in knowledge bases produced by volunteers. We do so by developing a novel computational approach to identify taboo subjects and by using this method to identify a set of articles on taboo subjects in English Wikipedia. We find that articles on taboo subjects are more popular than non-taboo articles and that they are frequently vandalized. Despite frequent vandalism attacks, we also find that taboo articles are higher quality than non-taboo articles. We hypothesize that stigmatizing societal attitudes will lead contributors to taboo subjects to seek to be less identifiable. Although our results are consistent with this proposal in several ways, we surprisingly find that contributors make themselves more identifiable in others.

Taboo and Collaborative Knowledge Production: Evidence from Wikipedia

TL;DR

This study develops a Wiktionary-based method to identify taboo subjects and applies it to English Wikipedia to compare taboo and non-taboo articles. Using longitudinal Wikipedia data, the authors test five hypotheses about readership, contributions, quality, and identifiability, finding that taboo articles are more popular and higher quality yet attract more vandalism and low-quality edits. The results reveal a nuanced privacy-identity dynamic among contributors, with mixed support for identifiability concerns. The work offers design implications for privacy in peer-produced knowledge systems and provides a scalable approach to studying taboo across languages and platforms.

Abstract

By definition, people are reticent or even unwilling to talk about taboo subjects. Because subjects like sexuality, health, and violence are taboo in most cultures, important information on each of these subjects can be difficult to obtain. Are peer produced knowledge bases like Wikipedia a promising approach for providing people with information on taboo subjects? With its reliance on volunteers who might also be averse to taboo, can the peer production model produce high-quality information on taboo subjects? In this paper, we seek to understand the role of taboo in knowledge bases produced by volunteers. We do so by developing a novel computational approach to identify taboo subjects and by using this method to identify a set of articles on taboo subjects in English Wikipedia. We find that articles on taboo subjects are more popular than non-taboo articles and that they are frequently vandalized. Despite frequent vandalism attacks, we also find that taboo articles are higher quality than non-taboo articles. We hypothesize that stigmatizing societal attitudes will lead contributors to taboo subjects to seek to be less identifiable. Although our results are consistent with this proposal in several ways, we surprisingly find that contributors make themselves more identifiable in others.
Paper Structure (27 sections, 5 figures, 4 tables)

This paper contains 27 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The Wiktionary definition of "member"---which has meanings that range from group association to anatomical.
  • Figure 2: Our analytical pipeline first extracts n-grams, labeling them taboo if they are drawn from definitions tagged as euphemistic. Our samples are drawn from those articles that match these n-grams.
  • Figure 3: Boxplots showing the distributions of article-level variables for five hypothesis tests. From top to bottom: (a) view rank of articles where view rank is calculated within a given month across all articles where the most viewed article would rank 1 (H1); (b) quantity of contributions (H2); (c) article-level revert rates (H3); (d) article-level damaging contribution rate (H3); and (e) quality of the articles (H4). Small vertical lines in the boxes indicate medians. Triangles are located at the mean.
  • Figure 4: Visualization of average article quality over time as predicted by the Wikimedia ORES API shown using generalized additive model (GAM) smoothers. We see that in the first several years of their existence, taboo subjects grow somewhat more quickly in quality, but that their quality growth over time begins to track more closely to the comparison set.
  • Figure 5: Average quality of the first version of new articles over time shown using generalized additive model (GAM) smoothers. The rug along the axes identifies the areas with the greatest concentration of data.