LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP
Danlu Chen, Freda Shi, Aditi Agarwal, Jacobo Myerston, Taylor Berg-Kirkpatrick
TL;DR
LogogramNLP introduces a benchmark for NLP on ancient logographic languages by pairing visual and textual representations across four writing systems (Linear A, Akkadian, Ancient Egyptian, Bamboo Script) and evaluating translation, parsing, and attribute classification. The study experiments with diverse feature encodings, including vocabulary extension, Latin transliteration proxies, tokenization-free methods, pixel-based text encoders (PIXEL), and full-document image encoding, combined with task-specific layers for MT, classification, and parsing. Key findings show that visual representations can outperform text-based ones on certain tasks (notably translation) and that visual encoders often transfer better from cross-lingual pretraining, while text-based approaches remain strong for some attribute classification signals. The results underscore the potential to unlock vast cultural heritage data for NLP analyses and point to continued improvements in OCR quality, broader language coverage, and standardized representations as critical future directions.
Abstract
Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription -- this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses.
