Reasoning Over the Glyphs: Evaluation of LLM's Decipherment of Rare Scripts
Yu-Fei Shih, Zheng-Lin Lin, Shu-Kai Hsieh
TL;DR
This paper investigates the decipherment of rare scripts not encoded in Unicode by evaluating LVLMs and LLMs on a specially constructed multimodal puzzle dataset. It introduces a glyph-tokenization scheme and two solving paradigms—Picture Method and Description Method—to study how visual language tokens can be handled without Unicode encoding. Across non-UniCODE and Unicode puzzles, the study reveals that Unicode encoding partially leverages pre-trained language knowledge for common languages but is insufficient for low-resource scripts without adequate training data, highlighting fundamental challenges in visual-token reasoning and linguistic interpretation. The work underscores the need for larger, diversified datasets and advanced instruction strategies to advance AI-assisted linguistic decipherment and the preservation of obscure writing systems.
Abstract
We explore the capabilities of LVLMs and LLMs in deciphering rare scripts not encoded in Unicode. We introduce a novel approach to construct a multimodal dataset of linguistic puzzles involving such scripts, utilizing a tokenization method for language glyphs. Our methods include the Picture Method for LVLMs and the Description Method for LLMs, enabling these models to tackle these challenges. We conduct experiments using prominent models, GPT-4o, Gemini, and Claude 3.5 Sonnet, on linguistic puzzles. Our findings reveal the strengths and limitations of current AI methods in linguistic decipherment, highlighting the impact of Unicode encoding on model performance and the challenges of modeling visual language tokens through descriptions. Our study advances understanding of AI's potential in linguistic decipherment and underscores the need for further research.
