Table of Contents
Fetching ...

Word Embeddings Track Social Group Changes Across 70 Years in China

Yuxi Ma, Yongqian Peng, Yixin Zhu

TL;DR

This study asks how official Chinese discourse encodes social groups over 1950–2019 and whether these representations differ from Western patterns, especially during radical social transformations. It combines diachronic word embeddings trained on the People's Daily (with annual and decade resolutions) and a secondary Google Books Chinese corpus to quantify group-trait associations via MAC and DiffMAC, supplemented by an event-centric WEAT framework. The authors show persistent asymmetries in valence across gender, ethnicity, age, and body type, with ethnicity and age patterns being relatively stable while gender and economic status undergo dramatic reversals linked to historical events such as the Cultural Revolution and post-1978 reforms. The work provides a non-Western perspective on how state-sanctioned language encodes social structure, offering methodological innovations for temporal linguistic analysis and highlighting the complex interplay between ideology and social change in China.

Abstract

Language encodes societal beliefs about social groups through word patterns. While computational methods like word embeddings enable quantitative analysis of these patterns, studies have primarily examined gradual shifts in Western contexts. We present the first large-scale computational analysis of Chinese state-controlled media (1950-2019) to examine how revolutionary social transformations are reflected in official linguistic representations of social groups. Using diachronic word embeddings at multiple temporal resolutions, we find that Chinese representations differ significantly from Western counterparts, particularly regarding economic status, ethnicity, and gender. These representations show distinct evolutionary dynamics: while stereotypes of ethnicity, age, and body type remain remarkably stable across political upheavals, representations of gender and economic classes undergo dramatic shifts tracking historical transformations. This work advances our understanding of how officially sanctioned discourse encodes social structure through language while highlighting the importance of non-Western perspectives in computational social science.

Word Embeddings Track Social Group Changes Across 70 Years in China

TL;DR

This study asks how official Chinese discourse encodes social groups over 1950–2019 and whether these representations differ from Western patterns, especially during radical social transformations. It combines diachronic word embeddings trained on the People's Daily (with annual and decade resolutions) and a secondary Google Books Chinese corpus to quantify group-trait associations via MAC and DiffMAC, supplemented by an event-centric WEAT framework. The authors show persistent asymmetries in valence across gender, ethnicity, age, and body type, with ethnicity and age patterns being relatively stable while gender and economic status undergo dramatic reversals linked to historical events such as the Cultural Revolution and post-1978 reforms. The work provides a non-Western perspective on how state-sanctioned language encodes social structure, offering methodological innovations for temporal linguistic analysis and highlighting the complex interplay between ideology and social change in China.

Abstract

Language encodes societal beliefs about social groups through word patterns. While computational methods like word embeddings enable quantitative analysis of these patterns, studies have primarily examined gradual shifts in Western contexts. We present the first large-scale computational analysis of Chinese state-controlled media (1950-2019) to examine how revolutionary social transformations are reflected in official linguistic representations of social groups. Using diachronic word embeddings at multiple temporal resolutions, we find that Chinese representations differ significantly from Western counterparts, particularly regarding economic status, ethnicity, and gender. These representations show distinct evolutionary dynamics: while stereotypes of ethnicity, age, and body type remain remarkably stable across political upheavals, representations of gender and economic classes undergo dramatic shifts tracking historical transformations. This work advances our understanding of how officially sanctioned discourse encodes social structure through language while highlighting the importance of non-Western perspectives in computational social science.

Paper Structure

This paper contains 24 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Corpus size comparison across time. We compare word counts between two Chinese language corpora from 1950 to 2019 The People's Daily corpus (red) maintains relatively stable coverage at approximately 10 million words per year throughout the period. The Google Books corpus (blue) shows temporal variation and particularly sparse coverage before 1980, with word counts fluctuating between 100 thousand and 100 million words per year.
  • Figure 2: Temporal evolution of Chinese social group representations (1950-2019). (a) Mean valence scores show persistent patterns across decades, with consistently higher valence for young (vs. elderly), ethnic minority (vs. Han Chinese), and thin (vs. fat) groups. (b) Mean valence scores demonstrate dramatic shifts and reversals for gender and economic status groups, including the 1970s gender valence reversal and the post-1978 transition in poor-rich dynamics.
  • Figure 3: Year-to-year correlation matrices revealing temporal dynamics in social representations (1950-2019). Each heatmap displays correlation coefficients between trait associations across years (red: strong positive; blue: weaker correlations; range: $0$ t0 $1.0$). Notable observations include (i) strong year-to-year stability near the diagonal ($r>0.8$); (ii) a distinct band of lower correlations during 1966-1976 (r<0.4), particularly evident in the "Young" and "Minorities" matrices; and (iii) varying impacts across groups during this period, with "Woman" and "Poor" showing relatively higher stability. Since paired groups under the same social category demonstrated similar correlation patterns, we present one representative plot from each category. (Vector graphics; zoom for details.)