Table of Contents
Fetching ...

No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Dasol Choi, Woomyoung Park, Youngsook Song

TL;DR

This study addresses the fragmentation of CJK NLP resources by providing the first large-scale, cross-language analysis of Chinese, Japanese, and Korean datasets on HuggingFace. It combines quantitative metadata analysis with qualitative insights to reveal language-specific ecosystem patterns, including Chinese's institutional backing, Korean's community-driven development, and Japanese's emphasis on subcultural content. The authors propose practical strategies for improving dataset documentation, licensing clarity, and cross-lingual resource sharing, highlighting opportunities for cross-language benchmarks and collaborative curation. The findings aim to guide culturally attuned and more inclusive East Asian LLM development, with living catalogs that track ecosystem evolution in real time.

Abstract

Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

TL;DR

This study addresses the fragmentation of CJK NLP resources by providing the first large-scale, cross-language analysis of Chinese, Japanese, and Korean datasets on HuggingFace. It combines quantitative metadata analysis with qualitative insights to reveal language-specific ecosystem patterns, including Chinese's institutional backing, Korean's community-driven development, and Japanese's emphasis on subcultural content. The authors propose practical strategies for improving dataset documentation, licensing clarity, and cross-lingual resource sharing, highlighting opportunities for cross-language benchmarks and collaborative curation. The findings aim to guide culturally attuned and more inclusive East Asian LLM development, with living catalogs that track ecosystem evolution in real time.

Abstract

Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

Paper Structure

This paper contains 28 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Datasets of each language sorted by number of downloads in descending order. Based on the decreasing pattern of downloads, we set the cutoff point at 700.
  • Figure 2: Distribution and Composition Analysis of CJK Language Datasets. (a) Illustrates the intersections among CJK language datasets, showing unique and overlapping dataset counts. (b) Shows the composition of the top 700 downloaded datasets for each language, categorized into monolingual, English-paired, and multilingual resources.
  • Figure 3: Task distribution across different languages. The heatmap illustrates the proportion of datasets belonging to the top 7 most frequent task categories across English, Chinese, Japanese, and Korean datasets.
  • Figure 4: License distribution across CJK and English datasets, showing the proportion of Permissive + PublicDomain, Copyleft + NonCommercial/ND, Unknown, and Other licenses for each language community.
  • Figure 5: Instruction Datasets Over Time by Language (English and CJK), from late 2022 to 2024.
  • ...and 2 more figures