Table of Contents
Fetching ...

"One-Size-Fits-All"? Examining Expectations around What Constitute "Fair" or "Good" NLG System Behaviors

Li Lucy, Su Lin Blodgett, Milad Shokouhi, Hanna Wallach, Alexandra Olteanu

TL;DR

This work interrogates what constitutes fair NLG behavior by contrasting invariance (uniform responses across social cues) with adaptation (context- or user-specific responses). It fixes a methodological framework around five identity-perturbation case studies and combines grounded theory with crowdsourced vignette analyses across multiple NLG systems in email-reply tasks. The findings reveal context-dependent preferences: adaptation can enhance realism and accommodation but risks stereotyping, while invariance supports safety and prescriptivism but may erase valid distinctions. The paper advances a nuanced evaluation framework for NLG fairness and highlights the need for participatory design to navigate conflicting expectations and real-world harms.

Abstract

Fairness-related assumptions about what constitute appropriate NLG system behaviors range from invariance, where systems are expected to behave identically for social groups, to adaptation, where behaviors should instead vary across them. To illuminate tensions around invariance and adaptation, we conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs. Through these cases studies, we examine people's expectations of system behaviors, and surface potential caveats of these contrasting yet commonly held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; in contrast, motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around what constitute "fair" or "good" NLG system behaviors.

"One-Size-Fits-All"? Examining Expectations around What Constitute "Fair" or "Good" NLG System Behaviors

TL;DR

This work interrogates what constitutes fair NLG behavior by contrasting invariance (uniform responses across social cues) with adaptation (context- or user-specific responses). It fixes a methodological framework around five identity-perturbation case studies and combines grounded theory with crowdsourced vignette analyses across multiple NLG systems in email-reply tasks. The findings reveal context-dependent preferences: adaptation can enhance realism and accommodation but risks stereotyping, while invariance supports safety and prescriptivism but may erase valid distinctions. The paper advances a nuanced evaluation framework for NLG fairness and highlights the need for participatory design to navigate conflicting expectations and real-world harms.

Abstract

Fairness-related assumptions about what constitute appropriate NLG system behaviors range from invariance, where systems are expected to behave identically for social groups, to adaptation, where behaviors should instead vary across them. To illuminate tensions around invariance and adaptation, we conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs. Through these cases studies, we examine people's expectations of system behaviors, and surface potential caveats of these contrasting yet commonly held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; in contrast, motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around what constitute "fair" or "good" NLG system behaviors.
Paper Structure (32 sections, 24 figures, 13 tables)

This paper contains 32 sections, 24 figures, 13 tables.

Figures (24)

  • Figure 1: Distributions of judges' responses to whether they generally believe that reply suggestions should adapt to a type of identity-related language feature.
  • Figure 2: Main body of task instructions and questions in CS1. Other case studies use a similar format.
  • Figure 3: Additional followup questions when at least one reply is deemed more usable. In this example, reply suggestion #1 is selected, so followup questions target the usability of reply suggestion #2.
  • Figure 4: Reasons judges marked the second reply as less usable or not usable in CS1. The second reply differs from the baseline reply option along the subcategory of reply behavior shown on the $y$-axis.
  • Figure 5: The six names tested during the crowdsourcing phase of CS1 evoke different levels of familiarity among judges. The $x$-axis binarizes responses so that Unfamiliar corresponds to responding Never seen it before, while Familiar corresponds to Somewhat or Extremely familiar.
  • ...and 19 more figures