Table of Contents
Fetching ...

Reasoning Over the Glyphs: Evaluation of LLM's Decipherment of Rare Scripts

Yu-Fei Shih, Zheng-Lin Lin, Shu-Kai Hsieh

TL;DR

This paper investigates the decipherment of rare scripts not encoded in Unicode by evaluating LVLMs and LLMs on a specially constructed multimodal puzzle dataset. It introduces a glyph-tokenization scheme and two solving paradigms—Picture Method and Description Method—to study how visual language tokens can be handled without Unicode encoding. Across non-UniCODE and Unicode puzzles, the study reveals that Unicode encoding partially leverages pre-trained language knowledge for common languages but is insufficient for low-resource scripts without adequate training data, highlighting fundamental challenges in visual-token reasoning and linguistic interpretation. The work underscores the need for larger, diversified datasets and advanced instruction strategies to advance AI-assisted linguistic decipherment and the preservation of obscure writing systems.

Abstract

We explore the capabilities of LVLMs and LLMs in deciphering rare scripts not encoded in Unicode. We introduce a novel approach to construct a multimodal dataset of linguistic puzzles involving such scripts, utilizing a tokenization method for language glyphs. Our methods include the Picture Method for LVLMs and the Description Method for LLMs, enabling these models to tackle these challenges. We conduct experiments using prominent models, GPT-4o, Gemini, and Claude 3.5 Sonnet, on linguistic puzzles. Our findings reveal the strengths and limitations of current AI methods in linguistic decipherment, highlighting the impact of Unicode encoding on model performance and the challenges of modeling visual language tokens through descriptions. Our study advances understanding of AI's potential in linguistic decipherment and underscores the need for further research.

Reasoning Over the Glyphs: Evaluation of LLM's Decipherment of Rare Scripts

TL;DR

This paper investigates the decipherment of rare scripts not encoded in Unicode by evaluating LVLMs and LLMs on a specially constructed multimodal puzzle dataset. It introduces a glyph-tokenization scheme and two solving paradigms—Picture Method and Description Method—to study how visual language tokens can be handled without Unicode encoding. Across non-UniCODE and Unicode puzzles, the study reveals that Unicode encoding partially leverages pre-trained language knowledge for common languages but is insufficient for low-resource scripts without adequate training data, highlighting fundamental challenges in visual-token reasoning and linguistic interpretation. The work underscores the need for larger, diversified datasets and advanced instruction strategies to advance AI-assisted linguistic decipherment and the preservation of obscure writing systems.

Abstract

We explore the capabilities of LVLMs and LLMs in deciphering rare scripts not encoded in Unicode. We introduce a novel approach to construct a multimodal dataset of linguistic puzzles involving such scripts, utilizing a tokenization method for language glyphs. Our methods include the Picture Method for LVLMs and the Description Method for LLMs, enabling these models to tackle these challenges. We conduct experiments using prominent models, GPT-4o, Gemini, and Claude 3.5 Sonnet, on linguistic puzzles. Our findings reveal the strengths and limitations of current AI methods in linguistic decipherment, highlighting the impact of Unicode encoding on model performance and the challenges of modeling visual language tokens through descriptions. Our study advances understanding of AI's potential in linguistic decipherment and underscores the need for further research.

Paper Structure

This paper contains 15 sections, 3 figures.

Figures (3)

  • Figure 1: Example of Tokenization of Glyphs
  • Figure 2: Incorrect geometric interpretations example.
  • Figure 3: Example for Token's Description on Meroitic