OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?
Zijian Chen, Tingzhu Chen, Wenjun Zhang, Guangtao Zhai
TL;DR
OBI-Bench presents a targeted, multi-task benchmark to evaluate large multi-modal models on whole-process Oracle Bone Inscriptions, spanning recognition, rejoining, classification, retrieval, and deciphering with 5,523 images from diverse sources. By introducing the O2BR and OBI-rejoin datasets and task-oriented prompts (What, Yes-or-No, How, Where), the paper systematically probes LMMs’ visual perception and domain cognition, revealing strong performance in deciphering and coarse perception but substantial gaps in fine-grained localization and knowledge-specific tasks. The experiments cover 23 models (6 proprietary, 17 open-source) and show that while top-tier models approach human-level ability in some areas, open-source LMMs lag significantly and continued domain-specific tuning and preprocessing are needed. The work argues that LMMs can meaningfully aid ancient language research and outlines directions for robust, trustworthy, domain-adapted multi-modal systems for OBI study.
Abstract
We introduce OBI-Bench, a holistic benchmark crafted to systematically evaluate large multi-modal models (LMMs) on whole-process oracle bone inscriptions (OBI) processing tasks demanding expert-level domain knowledge and deliberate cognition. OBI-Bench includes 5,523 meticulously collected diverse-sourced images, covering five key domain problems: recognition, rejoining, classification, retrieval, and deciphering. These images span centuries of archaeological findings and years of research by front-line scholars, comprising multi-stage font appearances from excavation to synthesis, such as original oracle bone, inked rubbings, oracle bone fragments, cropped single characters, and handprinted characters. Unlike existing benchmarks, OBI-Bench focuses on advanced visual perception and reasoning with OBI-specific knowledge, challenging LMMs to perform tasks akin to those faced by experts. The evaluation of 6 proprietary LMMs as well as 17 open-source LMMs highlights the substantial challenges and demands posed by OBI-Bench. Even the latest versions of GPT-4o, Gemini 1.5 Pro, and Qwen-VL-Max are still far from public-level humans in some fine-grained perception tasks. However, they perform at a level comparable to untrained humans in deciphering tasks, indicating remarkable capabilities in offering new interpretative perspectives and generating creative guesses. We hope OBI-Bench can facilitate the community to develop domain-specific multi-modal foundation models towards ancient language research and delve deeper to discover and enhance these untapped potentials of LMMs.
