Table of Contents
Fetching ...

LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP

Danlu Chen, Freda Shi, Aditi Agarwal, Jacobo Myerston, Taylor Berg-Kirkpatrick

TL;DR

LogogramNLP introduces a benchmark for NLP on ancient logographic languages by pairing visual and textual representations across four writing systems (Linear A, Akkadian, Ancient Egyptian, Bamboo Script) and evaluating translation, parsing, and attribute classification. The study experiments with diverse feature encodings, including vocabulary extension, Latin transliteration proxies, tokenization-free methods, pixel-based text encoders (PIXEL), and full-document image encoding, combined with task-specific layers for MT, classification, and parsing. Key findings show that visual representations can outperform text-based ones on certain tasks (notably translation) and that visual encoders often transfer better from cross-lingual pretraining, while text-based approaches remain strong for some attribute classification signals. The results underscore the potential to unlock vast cultural heritage data for NLP analyses and point to continued improvements in OCR quality, broader language coverage, and standardized representations as critical future directions.

Abstract

Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription -- this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses.

LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP

TL;DR

LogogramNLP introduces a benchmark for NLP on ancient logographic languages by pairing visual and textual representations across four writing systems (Linear A, Akkadian, Ancient Egyptian, Bamboo Script) and evaluating translation, parsing, and attribute classification. The study experiments with diverse feature encodings, including vocabulary extension, Latin transliteration proxies, tokenization-free methods, pixel-based text encoders (PIXEL), and full-document image encoding, combined with task-specific layers for MT, classification, and parsing. Key findings show that visual representations can outperform text-based ones on certain tasks (notably translation) and that visual encoders often transfer better from cross-lingual pretraining, while text-based approaches remain strong for some attribute classification signals. The results underscore the potential to unlock vast cultural heritage data for NLP analyses and point to continued improvements in OCR quality, broader language coverage, and standardized representations as critical future directions.

Abstract

Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription -- this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses.
Paper Structure (66 sections, 9 figures, 7 tables)

This paper contains 66 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Illustration of the processing flow of Old Chinese (in Bamboo Script), an ancient logographic language, best viewed in color. M denotes the pre-trained model used in the pipeline. Vision-based models directly process visual representations (violet; dashed lines). Conventional NLP pipelines (blue; solid lines) first convert visual representations into symbolic text---either automatically, which is quite noisy, or manually, which is labor-intensive. However, as shown, some ancient logographic writing systems have symbol inventories that have not yet been fully mapped into Unicode. Even when Unicode codepoints exist, they are often mutually exclusive with the symbol inventories of high-resource languages, reducing the effectiveness of transferring from pre-trained models. Finally, latinization (a potential solution for finding common ground with pre-training languages) loses information from the original input, is not fully standardized, and is difficult to automate.
  • Figure 2: Example of four logographic languages with different representation formats. The arrow shows the typical processing flow of ancient languages by humanists. The workload and expertise required to transcribe the text from images is even greater than that of downstream tasks such as machine translation. The red circle O (in Bamboo Script) indicates the character is not digitized as Unicode yet. Green dashed boxes note that Unicode exists for Egyptian hieroglyphics and Linear A, but the alignment to documents is unavailable; the same goes for Egyptian and Linear A photographs.
  • Figure 3: Image features of four ancient writing systems. (1) Egyptian hieroglyphs and Bamboo scripts are already manually segmented into images of lines. In the handcopy version of the Bamboo script, the word within parentheses indicates the corresponding modern Chinese glyph. Although both the Egyptian and Bamboo script images appear to be in a digital font, they are only accessible as images without underlying codepoint mappings to Unicode. (2) Linear A tablets are believed to be written in horizontal lines running from left to right salgarella2020aegean; therefore, we use the montage concatenation of each glyph as the representation. (3) We digitally render Cuneiform Unicode using computer font as the visual representation.
  • Figure 4: Case study for machine translation using the PIXEL-MT model. Notably, there are many spelling errors in the predictions, particularly with uncommon named entities.
  • Figure 5: Glyph classification on Old Chinese (ZHO). Left axis: we plot the error rate of glyph classification. The data point at |G| = 50 shows the classification error calculated using the top 50 most frequent glyphs in the dataset. The purple horizontal line (71.85%) represents the line-level text recognition CER for ZHO, provided for reference. Right axis: The frequency count (in orange bars) of each glyph in the dataset. Note that the counts are in logarithmic scale, illustrating the long tail distribution of glyph counts.
  • ...and 4 more figures