Table of Contents
Fetching ...

OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research

Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, Xu Peng, Taisong Jin, Yongge Liu, Shengwei Han, Jing Yang, Xiaoping He, Feng Gao, AndyPian Wu, SevenShu, Chaoyang Wang, Chengjie Wang

TL;DR

OracleAgent tackles two core challenges in Oracle Bone Script (OBS) research: the complexity of multi-step interpretation workflows and the fragmentation of OBS information retrieval. It combines an LLM-powered Brain with a suite of domain-specific, model-driven tools and a rigorously curated multimodal knowledge base (over 1.4M rubbing images and 80K interpretation texts) to enable end-to-end analysis, retrieval, and facsimile generation for OBS materials. Across detection, retrieval, classification, and generation tasks, OracleAgent achieves state-of-the-art multimodal reasoning and generation performance while reducing researchers’ time through automated evidence gathering and synthesis. This work advances practical OBS-assisted research and lays a foundation for scalable digitization and computational humanities of ancient scripts.

Abstract

As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.

OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research

TL;DR

OracleAgent tackles two core challenges in Oracle Bone Script (OBS) research: the complexity of multi-step interpretation workflows and the fragmentation of OBS information retrieval. It combines an LLM-powered Brain with a suite of domain-specific, model-driven tools and a rigorously curated multimodal knowledge base (over 1.4M rubbing images and 80K interpretation texts) to enable end-to-end analysis, retrieval, and facsimile generation for OBS materials. Across detection, retrieval, classification, and generation tasks, OracleAgent achieves state-of-the-art multimodal reasoning and generation performance while reducing researchers’ time through automated evidence gathering and synthesis. This work advances practical OBS-assisted research and lays a foundation for scalable digitization and computational humanities of ancient scripts.

Abstract

As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.

Paper Structure

This paper contains 30 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustrative examples of oracle bone images from different OBS data modalities.
  • Figure 2: Architecture overview of the proposed OracleAgent. OracleAgent consists of four modules: Perception, Brain, Tools, and Knowledge Bases. The Perception module accepts multimodal user inputs and infers user intent. The Brain stores states in Memory and integrates multimodal reasoning with tool-based decision-making. Some tools integrated within OracleAgent are capable of retrieving information from knowledge base.
  • Figure 3: OracleAgent Interaction Flow: Automated analysis of an oracle bone rubbing
  • Figure 4: OracleAgent Interaction Flow: Follow-up query about which catalogues record this oracle bone character from the last response.
  • Figure 5: Examples of facsimile image generation. Note that input image is a rubbing image and OracleAgent generate its facsimile form.
  • ...and 1 more figures