Table of Contents
Fetching ...

Generative AI as a Linguistic Equalizer in Global Science

Dragan Filimonovic, Christian Rutzer, Jeffrey Macher, Rolf Weder

TL;DR

The study addresses English dominance in global science by testing whether Generative AI (GenAI) acts as a linguistic equalizer. It analyzes 5.65 million English-language publications from non-English-speaking authors (2021–2024), flags GenAI-assisted work via field-specific lexical markers, and measures linguistic proximity to a U.S. English benchmark using SciBERT embeddings in an event-study Difference-in-Differences framework. The authors find post-2022 convergence of GenAI-assisted non-U.S. writings toward U.S. scientific English, strongest for linguistically distant countries and domestically authored teams, and across multiple fields. These findings imply GenAI can broaden global participation in science, with implications for editorial practice and research equity, while underscoring the need for transparent disclosure and equitable access to language tools.

Abstract

For decades, the dominance of English has created a substantial barrier in global science, disadvantaging non-native speakers. The recent rise of generative AI (GenAI) offers a potential technological response to this long-standing inequity. We provide the first large-scale evidence testing whether GenAI acts as a linguistic equalizer in global science. Drawing on 5.65 million scientific articles published from 2021 to 2024, we compare GenAI-assisted and non-assisted publications from authors in non-English-speaking countries. Using text embeddings derived from a pretrained large language model (SciBERT), we measure each publication's linguistic similarity to a benchmark of scientific writing from U.S.-based authors and track stylistic convergence over time. We find significant and growing convergence for GenAI-assisted publications after the release of ChatGPT in late 2022. The effect is strongest for domestic coauthor teams from countries linguistically distant from English. These findings provide large-scale evidence that GenAI is beginning to reshape global science communication by reducing language barriers in research.

Generative AI as a Linguistic Equalizer in Global Science

TL;DR

The study addresses English dominance in global science by testing whether Generative AI (GenAI) acts as a linguistic equalizer. It analyzes 5.65 million English-language publications from non-English-speaking authors (2021–2024), flags GenAI-assisted work via field-specific lexical markers, and measures linguistic proximity to a U.S. English benchmark using SciBERT embeddings in an event-study Difference-in-Differences framework. The authors find post-2022 convergence of GenAI-assisted non-U.S. writings toward U.S. scientific English, strongest for linguistically distant countries and domestically authored teams, and across multiple fields. These findings imply GenAI can broaden global participation in science, with implications for editorial practice and research equity, while underscoring the need for transparent disclosure and equitable access to language tools.

Abstract

For decades, the dominance of English has created a substantial barrier in global science, disadvantaging non-native speakers. The recent rise of generative AI (GenAI) offers a potential technological response to this long-standing inequity. We provide the first large-scale evidence testing whether GenAI acts as a linguistic equalizer in global science. Drawing on 5.65 million scientific articles published from 2021 to 2024, we compare GenAI-assisted and non-assisted publications from authors in non-English-speaking countries. Using text embeddings derived from a pretrained large language model (SciBERT), we measure each publication's linguistic similarity to a benchmark of scientific writing from U.S.-based authors and track stylistic convergence over time. We find significant and growing convergence for GenAI-assisted publications after the release of ChatGPT in late 2022. The effect is strongest for domestic coauthor teams from countries linguistically distant from English. These findings provide large-scale evidence that GenAI is beginning to reshape global science communication by reducing language barriers in research.

Paper Structure

This paper contains 3 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Global Landscape of GenAI-assisted Publications. This map shows the share of publications in 2024 identified as GenAI-assisted by a keyword-based screening of titles and abstracts relative to all publications of a country. Only countries with at least 100 publications in 2024 are included to remove small-sample noise. Color shading indicates shares expressed in percentage terms.
  • Figure 2: Linguistic Distance from English and GenAI Adoption This figure shows the relationship between linguistic proximity to U.S. English (x-axis, where 0 denotes maximum distance and 1 denotes identity) and the share of GenAI-assisted publications in 2024 (y-axis, expressed in percentage terms). Respective panels indicate all scientific domains and for each scientific domain. Countries with more than 100 publications in 2024 are included. Only major countries are explicitly labeled. The consistent negative slopes across panels indicate that GenAI adoption is higher in countries linguistically more distant from English, supporting the linguistic equalizer hypothesis. Regression coefficients ($\beta$), adjusted $R^2$ values, and p-values are reported in each panel. Significance: *** $p{<}0.001$, ** $p{<}0.01$, * $p{<}0.05$, $\cdot$$p{<}0.10$.
  • Figure 3: Overall and Scientific Field Linguistic Similarity to U.S. Scientific Writing This figure shows average linguistic similarity of publications from non-English-speaking countries to U.S. scientific writing, measured using SciBERT embeddings, and shown by publication year and aggregated scientific domain. Dashed lines denote values when publications in medicine are excluded. The upward shifts after 2022, especially in Engineering & Technology and Physical Sciences, indicate growing linguistic convergence consistent with GenAI diffusion.
  • Figure 4: Linguistic Similarity in Publications: Overall and by Scientific Field. This figure shows estimated effects based on the event-study regression described in Materials and Methods, comparing GenAI-assisted and non-GenAI-assisted publications from non-English-speaking countries. The dependent variable is the SciBERT-based linguistic similarity to U.S. scientific writing. Coefficients represent year-specific differences in linguistic similarity relative to the 2022 baseline; positive values indicate that GenAI-assisted texts converge more strongly toward U.S. writing than non-GenAI-assisted texts. All models include country, field, journal, year, and journal-year fixed effects, with heteroskedasticity-robust standard errors clustered at the journal level.
  • Figure 5: Linguistic Similarity Convergence: Subsample Estimations. This figure shows estimated effects based on the event-study regression described in Materials and Methods, comparing GenAI-assisted to non-GenAI-assisted publications from non-English-speaking countries. Arrows indicate the sample split within each panel. Panel (a) contrasts domestically coauthored papers (all authors from the same country) with internationally coauthored papers ($\geq$1 author from another country). Panel (b) contrasts countries linguistically close to English versus linguistically distant within domestically coauthored papers. Panel (c) contrasts coauthor teams with versus without a coauthor from an English-speaking country within internationally coauthored papers. Panel (d) contrasts publications in high-impact versus low-impact journals. The dependent variable is the SciBERT-based linguistic similarity to U.S. scientific writing. Coefficients are year-specific differences relative to 2022; positive values indicate that GenAI-assisted texts converge more toward U.S. writing than non-GenAI-assisted texts. The shaded “ChatGPT era” (2023–2024) highlights the post-introduction period. All models include country, field, journal, year and journal-year fixed effects, with heteroskedasticity-robust standard errors clustered at the journal level.
  • ...and 4 more figures