Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Yingfa Chen; Chenlong Hu; Cong Feng; Chenyang Song; Shi Yu; Xu Han; Zhiyuan Liu; Maosong Sun

Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Yingfa Chen, Chenlong Hu, Cong Feng, Chenyang Song, Shi Yu, Xu Han, Zhiyuan Liu, Maosong Sun

Abstract

This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts.

Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Abstract

Paper Structure (33 sections, 3 figures, 5 tables)

This paper contains 33 sections, 3 figures, 5 tables.

Introduction
Related Works
Tokenization
Chinese Tokenization
Deep Learning Applications in Ancient Scripts
Dataset
Chu Bamboo Slips
CHUBS
Data Source
Annotating Sub-Character Components
Sub-Character Component Annotation Scheme
Open Platform
Multi-Modal Multi-Granularity Tokenizer
Sub-Character Recognition
Experimental Details
...and 18 more sections

Figures (3)

Figure 1: Overview of our proposed tokenizer on an example. Each ancient character is mapped to a modern character if possible. Otherwise, the tokenizer rolls back to decomposing the character into sub-character units, potentially containing useful information. One possible deciphering of the text is "At first, action is not simple". The slip shown is the 14th slip in Zhonggong document from the Shanghai Museum Slips.
Figure 2: An example of a CBS material. The slip shown is the 98th slip of the "Wu Ji" from Tsinghua University Slips.
Figure 3: A screenshot of our platform for accessing the dataset and a demo of our tokenizer.

Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Abstract

Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Authors

Abstract

Table of Contents

Figures (3)