Table of Contents
Fetching ...

"Write in English, Nobody Understands Your Language Here": A Study of Non-English Trends in Open-Source Repositories

Masudul Hasan Masud Bhuiyan, Manish Kumar Bala Kumar, Cristian-Alexandru Staicu

TL;DR

The extent to which OSS is becoming more multilingual is investigated, analyzing 9.14 billion GitHub issues, pull requests, and discussions, and 62,500 repositories across five programming languages and 30 natural languages covering the period from 2015 to 2025.

Abstract

The open-source software (OSS) community has historically been dominated by English as the primary language for code, documentation, and developer interactions. However, with growing global participation and better support for non-Latin scripts through standards like Unicode, OSS is gradually becoming more multilingual. This study investigates the extent to which OSS is becoming more multilingual, analyzing 9.14 billion GitHub issues, pull requests, and discussions, and 62,500 repositories across five programming languages and 30 natural languages, covering the period from 2015 to 2025. We examine six research questions to track changes in language use across communication, code, and documentation. We find that multilingual participation has steadily increased, especially in Korean, Chinese, and Russian. This growth appears not only in issues and discussions but also in code comments, string literals, and documentation files. While this shift reflects greater inclusivity and language diversity in OSS, it also creates language tension. The ability to express oneself in a native language can clash with shared norms around English use, especially in collaborative settings. Non-English or multilingual projects tend to receive less visibility and participation, suggesting that language remains both a resource and a barrier, shaping who gets heard, who contributes, and how open collaboration unfolds.

"Write in English, Nobody Understands Your Language Here": A Study of Non-English Trends in Open-Source Repositories

TL;DR

The extent to which OSS is becoming more multilingual is investigated, analyzing 9.14 billion GitHub issues, pull requests, and discussions, and 62,500 repositories across five programming languages and 30 natural languages covering the period from 2015 to 2025.

Abstract

The open-source software (OSS) community has historically been dominated by English as the primary language for code, documentation, and developer interactions. However, with growing global participation and better support for non-Latin scripts through standards like Unicode, OSS is gradually becoming more multilingual. This study investigates the extent to which OSS is becoming more multilingual, analyzing 9.14 billion GitHub issues, pull requests, and discussions, and 62,500 repositories across five programming languages and 30 natural languages, covering the period from 2015 to 2025. We examine six research questions to track changes in language use across communication, code, and documentation. We find that multilingual participation has steadily increased, especially in Korean, Chinese, and Russian. This growth appears not only in issues and discussions but also in code comments, string literals, and documentation files. While this shift reflects greater inclusivity and language diversity in OSS, it also creates language tension. The ability to express oneself in a native language can clash with shared norms around English use, especially in collaborative settings. Non-English or multilingual projects tend to receive less visibility and participation, suggesting that language remains both a resource and a barrier, shaping who gets heard, who contributes, and how open collaboration unfolds.
Paper Structure (13 sections, 13 figures, 1 table)

This paper contains 13 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: Overview of our methodology: data collection, language detection, and analysis across code and discussions.
  • Figure 2: Evaluation of language detection models: (a) ROC curve and threshold selection for Lingua; (b) performance comparison with Google Translator on short code texts.
  • Figure 3: Monthly percentage of non-English content in GitHub discussions (issues, pull requests) from 2015 to 2025, showing a consistent upward trend and increased multilingual participation in open-source development.
  • Figure 4: Monthly distribution of non-English natural languages in GitHub discussions from 2015 to 2025, highlighting the sustained presence of Chinese, Japanese, Russian, Korean, and Vietnamese, and the growing participation in Korean, Persian, and Vietnamese over time.
  • Figure 5: Share of non-English content in code elements from 2015–2025, with comments and string literals showing the most growth.
  • ...and 8 more figures