Table of Contents
Fetching ...

OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?

Zijian Chen, Tingzhu Chen, Wenjun Zhang, Guangtao Zhai

TL;DR

OBI-Bench presents a targeted, multi-task benchmark to evaluate large multi-modal models on whole-process Oracle Bone Inscriptions, spanning recognition, rejoining, classification, retrieval, and deciphering with 5,523 images from diverse sources. By introducing the O2BR and OBI-rejoin datasets and task-oriented prompts (What, Yes-or-No, How, Where), the paper systematically probes LMMs’ visual perception and domain cognition, revealing strong performance in deciphering and coarse perception but substantial gaps in fine-grained localization and knowledge-specific tasks. The experiments cover 23 models (6 proprietary, 17 open-source) and show that while top-tier models approach human-level ability in some areas, open-source LMMs lag significantly and continued domain-specific tuning and preprocessing are needed. The work argues that LMMs can meaningfully aid ancient language research and outlines directions for robust, trustworthy, domain-adapted multi-modal systems for OBI study.

Abstract

We introduce OBI-Bench, a holistic benchmark crafted to systematically evaluate large multi-modal models (LMMs) on whole-process oracle bone inscriptions (OBI) processing tasks demanding expert-level domain knowledge and deliberate cognition. OBI-Bench includes 5,523 meticulously collected diverse-sourced images, covering five key domain problems: recognition, rejoining, classification, retrieval, and deciphering. These images span centuries of archaeological findings and years of research by front-line scholars, comprising multi-stage font appearances from excavation to synthesis, such as original oracle bone, inked rubbings, oracle bone fragments, cropped single characters, and handprinted characters. Unlike existing benchmarks, OBI-Bench focuses on advanced visual perception and reasoning with OBI-specific knowledge, challenging LMMs to perform tasks akin to those faced by experts. The evaluation of 6 proprietary LMMs as well as 17 open-source LMMs highlights the substantial challenges and demands posed by OBI-Bench. Even the latest versions of GPT-4o, Gemini 1.5 Pro, and Qwen-VL-Max are still far from public-level humans in some fine-grained perception tasks. However, they perform at a level comparable to untrained humans in deciphering tasks, indicating remarkable capabilities in offering new interpretative perspectives and generating creative guesses. We hope OBI-Bench can facilitate the community to develop domain-specific multi-modal foundation models towards ancient language research and delve deeper to discover and enhance these untapped potentials of LMMs.

OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?

TL;DR

OBI-Bench presents a targeted, multi-task benchmark to evaluate large multi-modal models on whole-process Oracle Bone Inscriptions, spanning recognition, rejoining, classification, retrieval, and deciphering with 5,523 images from diverse sources. By introducing the O2BR and OBI-rejoin datasets and task-oriented prompts (What, Yes-or-No, How, Where), the paper systematically probes LMMs’ visual perception and domain cognition, revealing strong performance in deciphering and coarse perception but substantial gaps in fine-grained localization and knowledge-specific tasks. The experiments cover 23 models (6 proprietary, 17 open-source) and show that while top-tier models approach human-level ability in some areas, open-source LMMs lag significantly and continued domain-specific tuning and preprocessing are needed. The work argues that LMMs can meaningfully aid ancient language research and outlines directions for robust, trustworthy, domain-adapted multi-modal systems for OBI study.

Abstract

We introduce OBI-Bench, a holistic benchmark crafted to systematically evaluate large multi-modal models (LMMs) on whole-process oracle bone inscriptions (OBI) processing tasks demanding expert-level domain knowledge and deliberate cognition. OBI-Bench includes 5,523 meticulously collected diverse-sourced images, covering five key domain problems: recognition, rejoining, classification, retrieval, and deciphering. These images span centuries of archaeological findings and years of research by front-line scholars, comprising multi-stage font appearances from excavation to synthesis, such as original oracle bone, inked rubbings, oracle bone fragments, cropped single characters, and handprinted characters. Unlike existing benchmarks, OBI-Bench focuses on advanced visual perception and reasoning with OBI-specific knowledge, challenging LMMs to perform tasks akin to those faced by experts. The evaluation of 6 proprietary LMMs as well as 17 open-source LMMs highlights the substantial challenges and demands posed by OBI-Bench. Even the latest versions of GPT-4o, Gemini 1.5 Pro, and Qwen-VL-Max are still far from public-level humans in some fine-grained perception tasks. However, they perform at a level comparable to untrained humans in deciphering tasks, indicating remarkable capabilities in offering new interpretative perspectives and generating creative guesses. We hope OBI-Bench can facilitate the community to develop domain-specific multi-modal foundation models towards ancient language research and delve deeper to discover and enhance these untapped potentials of LMMs.

Paper Structure

This paper contains 36 sections, 1 equation, 18 figures, 13 tables.

Figures (18)

  • Figure 1: Overview of the OBI-Bench. OBI-Bench presents five in-process tasks: 1) recognition: locating dense oracle bone characters from original oracle bone or rubbings; 2) rejoining: reconstructing fragmented text fragments into coherent texts; 3) classification: categorizing individual characters into their respective meanings; 4) retrieval: returning relevant results according to the given query OBI images; 5) deciphering: interpreting the OBI for historical and cultural investigation.
  • Figure 2: Sampled OBI-Bench examples from each task. 5,523 ( I,Q,A) tuples span two quadrants of OBI concerns and encompass four types of questions, providing an all-around evaluation of the ability of LMMs on OBI tasks. Note that OBI classification and retrieval tasks share the same queries.
  • Figure 3: Effects of the number of character categories on classification accuracy.
  • Figure 4: Qualitative comparison of deciphering results between two state-of-the-art LMMs, i.e., GPT-4o and Qwen-VL-Max, in a single round of direct questioning. It is noted that neither GPT-4o nor Qwen-VL-Max has fully deciphered these four oracle bone characters.
  • Figure 5: Interface of subjective experiments for the recognition task.
  • ...and 13 more figures