The Echoes of the 'I': Tracing Identity with Demographically Enhanced Word Embeddings
Ivan Smirnov
TL;DR
This paper addresses measuring identity by leveraging socio-demographic information in word embeddings to analyze self-views in natural language data. The authors propose enhanced I-tokens $I_{g,a}$ by replacing first-person pronouns with gender-age identifiers and training a CBOW model to produce semi-static embeddings. They show that the first principal component aligns with gender ($I_{ ext{woman},*}$ vs $I_{ ext{man},*}$), and that projection on a gender-stereotype axis reproduces established gendered self-views and reveals age-related dynamics, with components corresponding to younger and older ages. The approach is scalable to other demographics and data sources and enables cross-group comparisons without partitioning the corpus, offering a practical tool for sociolinguistic research and computational social science.
Abstract
Identity is one of the most commonly studied constructs in social science. However, despite extensive theoretical work on identity, there remains a need for additional empirical data to validate and refine existing theories. This paper introduces a novel approach to studying identity by enhancing word embeddings with socio-demographic information. As a proof of concept, we demonstrate that our approach successfully reproduces and extends established findings regarding gendered self-views. Our methodology can be applied in a wide variety of settings, allowing researchers to tap into a vast pool of naturally occurring data, such as social media posts. Unlike similar methods already introduced in computer science, our approach allows for the study of differences between social groups. This could be particularly appealing to social scientists and may encourage the faster adoption of computational methods in the field.
