Table of Contents
Fetching ...

The Language of Approval: Identifying the Drivers of Positive Feedback Online

Agam Goyal, Charlotte Lambert, Eshwar Chandrasekharan

TL;DR

The paper investigates what linguistic attributes causally drive positive feedback on Reddit by analyzing $11{,}000{,}000$ posts across $100$ subreddits. Using a selection-on-observables causal framework with risk-stratified matching and fixed effects, it isolates the impact of textual features on three reward signals: score, awards, and gold, while controlling for author reputation, timing, and community context. It then demonstrates that these same features yield strong predictive power for surfacing desirable posts in real time, with local subreddit models outperforming a global model in many cases (mean AUC $0.726$ versus $0.654$ globally). An audit against surveys and guidelines reveals a policy-practice gap: guidelines focus on civility and formatting, not on empirically supported linguistic strategies that boost positive reception. The work advances theory and practice by offering a rich, causally-informed feature set, actionable guidance for formation-oriented guidelines, and a framework for proactive moderation that emphasizes positive reinforcement over purely punitive strategies.

Abstract

Positive feedback via likes and awards is central to online governance, yet which attributes of users' posts elicit rewards -- and how these vary across authors and communities -- remains unclear. To examine this, we combine quasi-experimental causal inference with predictive modeling on 11M posts from 100 subreddits. We identify linguistic patterns and stylistic attributes causally linked to rewards, controlling for author reputation, timing, and community context. For example, overtly complicated language, tentative style, and toxicity reduce rewards. We use our set of curated features to train models that can detect highly-upvoted posts with high AUC. Our audit of community guidelines highlights a ``policy-practice gap'' -- most rules focus primarily on civility and formatting requirements, with little emphasis on the attributes identified to drive positive feedback. These results inform the design of community guidelines, support interfaces that teach users how to craft desirable contributions, and moderation workflows that emphasize positive reinforcement over purely punitive enforcement.

The Language of Approval: Identifying the Drivers of Positive Feedback Online

TL;DR

The paper investigates what linguistic attributes causally drive positive feedback on Reddit by analyzing posts across subreddits. Using a selection-on-observables causal framework with risk-stratified matching and fixed effects, it isolates the impact of textual features on three reward signals: score, awards, and gold, while controlling for author reputation, timing, and community context. It then demonstrates that these same features yield strong predictive power for surfacing desirable posts in real time, with local subreddit models outperforming a global model in many cases (mean AUC versus globally). An audit against surveys and guidelines reveals a policy-practice gap: guidelines focus on civility and formatting, not on empirically supported linguistic strategies that boost positive reception. The work advances theory and practice by offering a rich, causally-informed feature set, actionable guidance for formation-oriented guidelines, and a framework for proactive moderation that emphasizes positive reinforcement over purely punitive strategies.

Abstract

Positive feedback via likes and awards is central to online governance, yet which attributes of users' posts elicit rewards -- and how these vary across authors and communities -- remains unclear. To examine this, we combine quasi-experimental causal inference with predictive modeling on 11M posts from 100 subreddits. We identify linguistic patterns and stylistic attributes causally linked to rewards, controlling for author reputation, timing, and community context. For example, overtly complicated language, tentative style, and toxicity reduce rewards. We use our set of curated features to train models that can detect highly-upvoted posts with high AUC. Our audit of community guidelines highlights a ``policy-practice gap'' -- most rules focus primarily on civility and formatting requirements, with little emphasis on the attributes identified to drive positive feedback. These results inform the design of community guidelines, support interfaces that teach users how to craft desirable contributions, and moderation workflows that emphasize positive reinforcement over purely punitive enforcement.

Paper Structure

This paper contains 67 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of causal inference framework for identifying linguistic drivers of positive feedback on Reddit. We analyze 11M posts from 100 subreddits (May-September 2020) using a selection-on-observables approach that isolates causal effects of linguistic attributes from confounding factors. The framework combines rich linguistic feature extraction (100+ attributes spanning LIWC categories, surface markers, semantic style dimensions, topics, and toxicity) with baseline covariates capturing author reputation and activity patterns from a 14-day pre-period. Matching based on subreddits-wise risk stratification creates balanced comparison groups, while fixed effects control for community norms and temporal variation. The logistic regression model estimates causal effects on three positive feedback outcomes: high scores (top $25\%$ within subreddit-month, $813\text{k}$ candidates), awards ($283\text{k}$ candidates), and gold ($54\text{k}$ candidates). Key causal findings include: discussion-generating posts have $43\%$ higher odds of high scores compared to narrow help requests; clear, readable writing increases odds by $40\%$; question-heavy framing reduces odds by $30\%$; and toxicity consistently decreases reception. Newcomers face baseline disadvantages by $6\%$ but benefit disproportionately from clarity and future-focused language.
  • Figure 2: Standardized mean differences (SMD) for covariates in $\mathbf{Z}$, averaged across retained strata. Risk stratification generally reduces imbalance; where pre-stratification imbalance was already low, changes are small. For all outcomes, the mean post-stratification SMD is below the 0.30 threshold, indicating adequate balance.
  • Figure 3: (Left) Distribution of AUC scores for subreddit‑specific (local) models trained on posts from specific subreddits; the red dashed line marks the global model’s overall AUC (0.654), while the green dashed line marks the average AUC of local models (0.726). (Right) Per‑subreddit AUC difference between local and global models ($\Delta=$ Local AUC - Global AUC) evaluated on the test set of each subreddit: green bars to the right of $0$ indicate local models outperform the global model, red bars to the left of $0$ indicate the global model performs better. Only the top-10 gains and losses are shown.
  • Figure 4: Color-coded rules for r/unpopularopinion and r/pcmasterrace subreddits. Most rules in each of the subreddits falls under the "Restrictions" category (yellow), highlighting what the user cannot do. Other rules include anti-toxicity guidelines (red) or post-formatting guidelines (blue). None of the rules or their framing highlight how a user can make "desirable" contributions to teach them the values of the community.
  • Figure 5: Detailed description of a rule from r/pcmasterrace which teaches users how to post rule-adhering links to prevent removals by moderators. Similar exemplars for how to frame posts to align with the positive values and rewarded linguistic patterns of the community would help users make better contributions.
  • ...and 1 more figures