Towards Real-World Writing Assistance: A Chinese Character Checking Benchmark with Faked and Misspelled Characters

Yinghui Li; Zishan Xu; Shaoshen Chen; Haojing Huang; Yangning Li; Yong Jiang; Zhongli Li; Qingyu Zhou; Hai-Tao Zheng; Ying Shen

Towards Real-World Writing Assistance: A Chinese Character Checking Benchmark with Faked and Misspelled Characters

Yinghui Li, Zishan Xu, Shaoshen Chen, Haojing Huang, Yangning Li, Yong Jiang, Zhongli Li, Qingyu Zhou, Hai-Tao Zheng, Ying Shen

TL;DR

This work introduces Visual-C3, the first large visual Chinese Character Checking benchmark that explicitly includes faked and misspelled characters in handwritten text. It combines sentence-level and character-level annotations to support end-to-end detection and correction in real-world scenarios, and evaluates two baseline approaches: an OCR-based pipeline and a CLIP-based retrieval-correction pipeline. Experimental results show the dataset is high-quality yet challenging, with faked characters being particularly difficult to detect and correct, highlighting gaps for future end-to-end, multimodal models. The dataset and baselines aim to accelerate progress in intelligent writing assistance for real-world Chinese text understanding and correction.

Abstract

Writing assistance is an application closely related to human life and is also a fundamental Natural Language Processing (NLP) research field. Its aim is to improve the correctness and quality of input texts, with character checking being crucial in detecting and correcting wrong characters. From the perspective of the real world where handwriting occupies the vast majority, characters that humans get wrong include faked characters (i.e., untrue characters created due to writing errors) and misspelled characters (i.e., true characters used incorrectly due to spelling errors). However, existing datasets and related studies only focus on misspelled characters mainly caused by phonological or visual confusion, thereby ignoring faked characters which are more common and difficult. To break through this dilemma, we present Visual-C$^3$, a human-annotated Visual Chinese Character Checking dataset with faked and misspelled Chinese characters. To the best of our knowledge, Visual-C$^3$ is the first real-world visual and the largest human-crafted dataset for the Chinese character checking scenario. Additionally, we also propose and evaluate novel baseline methods on Visual-C$^3$. Extensive empirical results and analyses show that Visual-C$^3$ is high-quality yet challenging. The Visual-C$^3$ dataset and the baseline methods will be publicly available to facilitate further research in the community.

Towards Real-World Writing Assistance: A Chinese Character Checking Benchmark with Faked and Misspelled Characters

TL;DR

Abstract

, a human-annotated Visual Chinese Character Checking dataset with faked and misspelled Chinese characters. To the best of our knowledge, Visual-C

is the first real-world visual and the largest human-crafted dataset for the Chinese character checking scenario. Additionally, we also propose and evaluate novel baseline methods on Visual-C

. Extensive empirical results and analyses show that Visual-C

is high-quality yet challenging. The Visual-C

dataset and the baseline methods will be publicly available to facilitate further research in the community.

Paper Structure (35 sections, 2 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 2 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Related Works
Chinese Spell Checking
OCR Error Correction
The Visual-C$^3$ Dataset
Dataset Construction
Data Collection
Data Preprocessing
Annotation Schema
Annotation Workflow
Dataset Analysis
Dataset Statistics
Dataset Quality
Benchmark Settings
Task Formulation
...and 20 more sections

Figures (6)

Figure 1: Examples of Chinese faked characters (错字) and misspelled characters (别字). Orange/red represents the misspelled character and the faked character.
Figure 2: Overview of the construction process of Visual-C$^3$. "U" represents the unknown character, and "X" represents the faked character.
Figure 3: Illustration of our designed baseline methods, namely OCR-based method (top) and CLIP-based method (bottom).
Figure 4: Statistics of characters that are mishandled. The total numbers of wrong and correct characters in the test set of Visual-C$^3$ are 2,011, and 79,141 respectively.
Figure 5: Some examples of our designed baselines. A ✓ mark indicates that the output of the corresponding method is correct, and a ✘ mark means that the output of the corresponding method is problematic.
...and 1 more figures

Towards Real-World Writing Assistance: A Chinese Character Checking Benchmark with Faked and Misspelled Characters

TL;DR

Abstract

Towards Real-World Writing Assistance: A Chinese Character Checking Benchmark with Faked and Misspelled Characters

Authors

TL;DR

Abstract

Table of Contents

Figures (6)