Stop! In the Name of Flaws: Disentangling Personal Names and Sociodemographic Attributes in NLP
Vagrant Gautam, Arjun Subramonian, Anne Lauscher, Os Keyes
TL;DR
This paper addresses the methodological and ethical challenges of using personal names to infer sociodemographic attributes in NLP. It synthesizes interdisciplinary insights on naming, highlights validity concerns (e.g., ground-truth limitations, construct validity, selection bias), and delineates ethical risks (harms, unequal impacts, cultural insensitivity, and power dynamics). It then offers guiding questions and normative recommendations to steer future work toward studying names with care for individuals and communities, emphasizing context, feasibility, and potential harms. The work argues for reframing analyses away from treating names as reliable proxies for identity, and for approaches that center autonomy, justice, and reflexivity in NLP research and deployment. Overall, it provides a critical, actionable framework to improve validity and social responsibility in name-related NLP studies and applications.
Abstract
Personal names simultaneously differentiate individuals and categorize them in ways that are important in a given society. While the natural language processing community has thus associated personal names with sociodemographic characteristics in a variety of tasks, researchers have engaged to varying degrees with the established methodological problems in doing so. To guide future work that uses names and sociodemographic characteristics, we provide an overview of relevant research: first, we present an interdisciplinary background on names and naming. We then survey the issues inherent to associating names with sociodemographic attributes, covering problems of validity (e.g., systematic error, construct validity), as well as ethical concerns (e.g., harms, differential impact, cultural insensitivity). Finally, we provide guiding questions along with normative recommendations to avoid validity and ethical pitfalls when dealing with names and sociodemographic characteristics in natural language processing.
