Enriching Datasets with Demographics through Large Language Models: What's in a Name?

Khaled AlNuaimi; Gautier Marti; Mathieu Ravaut; Abdulla AlKetbi; Andreas Henschel; Raed Jaradat

Enriching Datasets with Demographics through Large Language Models: What's in a Name?

Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat

TL;DR

It is demonstrated that the zero-shot capabilities of Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data.

Abstract

Enriching datasets with demographic information, such as gender, race, and age from names, is a critical task in fields like healthcare, public policy, and social sciences. Such demographic insights allow for more precise and effective engagement with target populations. Despite previous efforts employing hidden Markov models and recurrent neural networks to predict demographics from names, significant limitations persist: the lack of large-scale, well-curated, unbiased, publicly available datasets, and the lack of an approach robust across datasets. This scarcity has hindered the development of traditional supervised learning approaches. In this paper, we demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data. We apply these LLMs to a variety of datasets, including a real-life, unlabelled dataset of licensed financial professionals in Hong Kong, and critically assess the inherent demographic biases in these models. Our work not only advances the state-of-the-art in demographic enrichment but also opens avenues for future research in mitigating biases in LLMs.

Enriching Datasets with Demographics through Large Language Models: What's in a Name?

TL;DR

It is demonstrated that the zero-shot capabilities of Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data.

Abstract

Paper Structure (29 sections, 1 equation, 5 figures, 8 tables)

This paper contains 29 sections, 1 equation, 5 figures, 8 tables.

Introduction
Related Work
Predicting Demographic Attributes from Names
Existing Datasets and Their Limitations
Machine Learning Approaches
Novelty of Our Approach
Task
Experiments
Setup
Datasets
Data cleaning
LLMs
Inference
Evaluation
Results
...and 14 more sections

Figures (5)

Figure 1: Race distribution split by gender on the Florida Voters test set. Race is reduced from nine to five classes, as in prior work.
Figure 2: Nationality distribution on the Wikipedia test set. The distribution is long-tail and skewed towards English-speaking countries and Europe. The top 30 nationalities displayed account for 87% of data points.
Figure 3: Comparison of actual vs. predicted birth dates (Claude-3.5-sonnet, Llama-3.1-8b) on Florida Voters.
Figure 4: Density of age prediction on the Hong Kong SFC professionals dataset, for four LLMs.
Figure 5: Hierarchical clustering of LLMs based on their agreement on predictions for the three datasets: Florida, Wikipedia, and HK SFC. Left to right: (a) Race, (b) Nationality (complex setup), (c) Ethnicity, and (d) Predicted Age agreement.

Enriching Datasets with Demographics through Large Language Models: What's in a Name?

TL;DR

Abstract

Enriching Datasets with Demographics through Large Language Models: What's in a Name?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)