The Echoes of the 'I': Tracing Identity with Demographically Enhanced Word Embeddings

Ivan Smirnov

The Echoes of the 'I': Tracing Identity with Demographically Enhanced Word Embeddings

Ivan Smirnov

TL;DR

This paper addresses measuring identity by leveraging socio-demographic information in word embeddings to analyze self-views in natural language data. The authors propose enhanced I-tokens $I_{g,a}$ by replacing first-person pronouns with gender-age identifiers and training a CBOW model to produce semi-static embeddings. They show that the first principal component aligns with gender ($I_{ ext{woman},*}$ vs $I_{ ext{man},*}$), and that projection on a gender-stereotype axis reproduces established gendered self-views and reveals age-related dynamics, with components corresponding to younger and older ages. The approach is scalable to other demographics and data sources and enables cross-group comparisons without partitioning the corpus, offering a practical tool for sociolinguistic research and computational social science.

Abstract

Identity is one of the most commonly studied constructs in social science. However, despite extensive theoretical work on identity, there remains a need for additional empirical data to validate and refine existing theories. This paper introduces a novel approach to studying identity by enhancing word embeddings with socio-demographic information. As a proof of concept, we demonstrate that our approach successfully reproduces and extends established findings regarding gendered self-views. Our methodology can be applied in a wide variety of settings, allowing researchers to tap into a vast pool of naturally occurring data, such as social media posts. Unlike similar methods already introduced in computer science, our approach allows for the study of differences between social groups. This could be particularly appealing to social scientists and may encourage the faster adoption of computational methods in the field.

The Echoes of the 'I': Tracing Identity with Demographically Enhanced Word Embeddings

TL;DR

This paper addresses measuring identity by leveraging socio-demographic information in word embeddings to analyze self-views in natural language data. The authors propose enhanced I-tokens

by replacing first-person pronouns with gender-age identifiers and training a CBOW model to produce semi-static embeddings. They show that the first principal component aligns with gender (

), and that projection on a gender-stereotype axis reproduces established gendered self-views and reveals age-related dynamics, with components corresponding to younger and older ages. The approach is scalable to other demographics and data sources and enables cross-group comparisons without partitioning the corpus, offering a practical tool for sociolinguistic research and computational social science.

Abstract

Paper Structure (5 sections, 3 figures)

This paper contains 5 sections, 3 figures.

Introduction
Methods
Data & Model
Results
Discussion

Figures (3)

Figure 1: The structure of enhanced I-token embeddings. The first principal component extracted from embeddings of enhanced I-tokens corresponds to gender (a, b). Curiously, age is represented by two components: the second component corresponds to a younger age (a), while the third corresponds to an older age (b).
Figure 2: Projection of enhanced I-tokens on gender stereotype axis reproduces established findings on gendered self-views.$\text{I}_{\text{woman},*}$ tokens are closer to women's pole of the axis, while $\text{I}_{\text{man},*}$ tokens are closer to the men's pole, with the distance between them being larger than what could be explained by chance (a). The gap narrows with age as $\text{I}_{\text{man},*}$ tokens shift towards the center (b).
Figure 3: Robustness of the results with respect to model specification. We evaluated how much point-biserial correlations between gender and the first principal component extracted from enhanced I-tokens (orange), as well as between gender and the projection of I-tokens on the gender stereotype axis (blue), depend on model specification. We found that no further training is required beyond one epoch to reproduce the results (a). We also found that any reasonable number of dimensions can be used (b). Finally, we found that 100MB is a sufficient corpus size, but beyond that point, the performance drops for adjectives as they become too rare. The first principal component of enhanced I-tokens remains strongly associated with gender for all our experiments.

The Echoes of the 'I': Tracing Identity with Demographically Enhanced Word Embeddings

TL;DR

Abstract

The Echoes of the 'I': Tracing Identity with Demographically Enhanced Word Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (3)