Table of Contents
Fetching ...

Finding structure in logographic writing with library learning

Guangyuan Jiang, Matthias Hofer, Jiayuan Mao, Lionel Wong, Joshua B. Tenenbaum, Roger P. Levy

TL;DR

The paper tackles how combinatorial structure in writing emerges from efficiency-driven biases by introducing a library-learning framework to reverse-engineer structure in logographic scripts. It treats Chinese characters as stroke sequences and learns reusable abstractions via a minimum description length objective, enabling hierarchical decomposition and radical discovery. The results show that the learned library recovers most MoE radicals (about 93%) and captures known radical decompositions, while achieving substantial compression (approximately 4.16×) and greatly reducing per-character representation. In a diachronic analysis, the approach reveals a trend toward simplification across historical scripts, with traditional Chinese retaining more systematic structure than the simplified form, offering a computational perspective on the evolution of efficient communication systems. The work provides a principled, compression-based lens on how cognitive representations and cultural evolution shape writing systems over millennia.

Abstract

One hallmark of human language is its combinatoriality -- reusing a relatively small inventory of building blocks to create a far larger inventory of increasingly complex structures. In this paper, we explore the idea that combinatoriality in language reflects a human inductive bias toward representational efficiency in symbol systems. We develop a computational framework for discovering structure in a writing system. Built on top of state-of-the-art library learning and program synthesis techniques, our computational framework discovers known linguistic structures in the Chinese writing system and reveals how the system evolves towards simplification under pressures for representational efficiency. We demonstrate how a library learning approach, utilizing learned abstractions and compression, may help reveal the fundamental computational principles that underlie the creation of combinatorial structures in human cognition, and offer broader insights into the evolution of efficient communication systems.

Finding structure in logographic writing with library learning

TL;DR

The paper tackles how combinatorial structure in writing emerges from efficiency-driven biases by introducing a library-learning framework to reverse-engineer structure in logographic scripts. It treats Chinese characters as stroke sequences and learns reusable abstractions via a minimum description length objective, enabling hierarchical decomposition and radical discovery. The results show that the learned library recovers most MoE radicals (about 93%) and captures known radical decompositions, while achieving substantial compression (approximately 4.16×) and greatly reducing per-character representation. In a diachronic analysis, the approach reveals a trend toward simplification across historical scripts, with traditional Chinese retaining more systematic structure than the simplified form, offering a computational perspective on the evolution of efficient communication systems. The work provides a principled, compression-based lens on how cognitive representations and cultural evolution shape writing systems over millennia.

Abstract

One hallmark of human language is its combinatoriality -- reusing a relatively small inventory of building blocks to create a far larger inventory of increasingly complex structures. In this paper, we explore the idea that combinatoriality in language reflects a human inductive bias toward representational efficiency in symbol systems. We develop a computational framework for discovering structure in a writing system. Built on top of state-of-the-art library learning and program synthesis techniques, our computational framework discovers known linguistic structures in the Chinese writing system and reveals how the system evolves towards simplification under pressures for representational efficiency. We demonstrate how a library learning approach, utilizing learned abstractions and compression, may help reveal the fundamental computational principles that underlie the creation of combinatorial structures in human cognition, and offer broader insights into the evolution of efficient communication systems.
Paper Structure (19 sections, 4 equations, 5 figures, 2 tables)

This paper contains 19 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: An overview of our library learning model for writing systems: (A) Parts are frequently reused (marked in the same color) within and across characters in the multiple logographic writing systems (e.g., Cuneiform, Chinese). (B) In our library learning model, we represent characters as stroke sequences. Learned library functions identify and represent reused parts (e.g., 木) and relations (e.g., x3, repeating three times), leading to program compression and the discovery of structures. (C) The model scales to study the multiple scripts in the Chinese writing system across time, revealing trends and adaptations in the use of radicals and other elements.
  • Figure 2: (Left): Primitives used in the base dsl $\mathcal{L}_{base}$, including 33 stroke primitives and one list symbol. (Middle): Character programs represented in $\mathcal{L}_{base}$ and the library functions learned by the model. (Right): Visualization of an example character 颢's hierarchical decomposition discovered, represented as a tree of library functions.
  • Figure 3: Visualization of the aligned MoE radical--library function pairs. 201 radicals from the MoE radicals set are colored in blue, corresponding library functions are colored in gray on the top left (we omit the fn_ prefix for brevity). Our model discovered most of the expert-defined radicals ($93.0\%$).
  • Figure 4: Changes in the quantifiable metrics over time. We visualize Left:pictorial complexity (following han2022simplification), program (mdl) complexity calculated by our model $C(\mathcal{W})$; Right:description length under the base library$\mathrm{DL}_{\mathcal{L}_{base}}(\mathcal{W})$, learned library size$\lvert \mathcal{L}_{*} \rvert$, and compression ratio$\mathrm{DL}_{\mathcal{L}_{base}}(\mathcal{W}) / \mathrm{DL}_{\mathcal{L}_{*}}(\mathcal{W})$ for the four scripts (oracle bone, seal, traditional, and simplified) respectively.
  • Figure 5: Comparison of the compression ratio between traditional and simplified Chinese. The traditional Chinese script is more compressible than simplified Chinese on the 3,762 aligned characters at a larger scale.