Word Embedding for Social Sciences: An Interdisciplinary Survey
Akira Matsui, Emilio Ferrara
TL;DR
The paper tackles the fragmentation of word-embedding literature across social sciences and proposes an integrative survey with a taxonomy centered on word2vec applications. It surveys diverse studies, categorizing them by research topics and nine analysis-method labels, including Pre-trained models, Overfitting, Working Variables, Reference Words, and Non-text data usage, with a mathematical grounding in SGNS learning. A representative simple experiment demonstrates that cosine similarity and Euclidean distance can yield different results, underscoring the importance of metric choice. The work highlights a shift toward non-text data and emphasizes cross-disciplinary communication to improve methodological clarity and transferability. Together, the taxonomy and empirical insights provide practical guidance for social scientists applying word embeddings and for method developers refining alignment and interpretation across domains.
Abstract
To extract essential information from complex data, computer scientists have been developing machine learning models that learn low-dimensional representation mode. From such advances in machine learning research, not only computer scientists but also social scientists have benefited and advanced their research because human behavior or social phenomena lies in complex data. However, this emerging trend is not well documented because different social science fields rarely cover each other's work, resulting in fragmented knowledge in the literature. To document this emerging trend, we survey recent studies that apply word embedding techniques to human behavior mining. We built a taxonomy to illustrate the methods and procedures used in the surveyed papers, aiding social science researchers in contextualizing their research within the literature on word embedding applications. This survey also conducts a simple experiment to warn that common similarity measurements used in the literature could yield different results even if they return consistent results at an aggregate level.
