Table of Contents
Fetching ...

Improving Chinese Character Representation with Formation Tree

Yang Hong, Yinfei Li, Xiaojun Qiao, Rui Li, Junsong Zhang

TL;DR

This work tackles the challenge of generalizing Chinese character representations, especially for unseen characters, under the long-tail distribution. It introduces Formation Tree-CLIP (FT-CLIP), which represents characters as formation trees and uses a dedicated Formation Tree Transformer (tree encoder) trained with a CLIP-style objective, aided by masking strategies in both the image and tree modalities. Key contributions include the formation-tree representation with $12$ IDS formation types and $26$ azimuths, SubTree and Azimuth encodings in the tree encoder, and image/tree masking to accelerate training and improve accuracy, yielding state-of-the-art results on unseen and handwritten seen-character tasks with a lightweight model. The approach aligns closely with the intrinsic hierarchical structure of Chinese characters, improves generalization to unseen radicals, and reduces computation, enabling faster deployment in practical recognition systems.

Abstract

Learning effective representations for Chinese characters presents unique challenges, primarily due to the vast number of characters and their continuous growth, which requires models to handle an expanding category space. Additionally, the inherent sparsity of character usage complicates the generalization of learned representations. Prior research has explored radical-based sequences to overcome these issues, achieving progress in recognizing unseen characters. However, these approaches fail to fully exploit the inherent tree structure of such sequences. To address these limitations and leverage established data properties, we propose Formation Tree-CLIP (FT-CLIP). This model utilizes formation trees to represent characters and incorporates a dedicated tree encoder, significantly improving performance in both seen and unseen character recognition tasks. We further introduce masking for to both character images and tree nodes, enabling efficient and effective training. This approach accelerates training significantly (by a factor of 2 or more) while enhancing accuracy. Extensive experiments show that processing characters through formation trees aligns better with their inherent properties than direct sequential methods, significantly enhancing the generality and usability of the representations.

Improving Chinese Character Representation with Formation Tree

TL;DR

This work tackles the challenge of generalizing Chinese character representations, especially for unseen characters, under the long-tail distribution. It introduces Formation Tree-CLIP (FT-CLIP), which represents characters as formation trees and uses a dedicated Formation Tree Transformer (tree encoder) trained with a CLIP-style objective, aided by masking strategies in both the image and tree modalities. Key contributions include the formation-tree representation with IDS formation types and azimuths, SubTree and Azimuth encodings in the tree encoder, and image/tree masking to accelerate training and improve accuracy, yielding state-of-the-art results on unseen and handwritten seen-character tasks with a lightweight model. The approach aligns closely with the intrinsic hierarchical structure of Chinese characters, improves generalization to unseen radicals, and reduces computation, enabling faster deployment in practical recognition systems.

Abstract

Learning effective representations for Chinese characters presents unique challenges, primarily due to the vast number of characters and their continuous growth, which requires models to handle an expanding category space. Additionally, the inherent sparsity of character usage complicates the generalization of learned representations. Prior research has explored radical-based sequences to overcome these issues, achieving progress in recognizing unseen characters. However, these approaches fail to fully exploit the inherent tree structure of such sequences. To address these limitations and leverage established data properties, we propose Formation Tree-CLIP (FT-CLIP). This model utilizes formation trees to represent characters and incorporates a dedicated tree encoder, significantly improving performance in both seen and unseen character recognition tasks. We further introduce masking for to both character images and tree nodes, enabling efficient and effective training. This approach accelerates training significantly (by a factor of 2 or more) while enhancing accuracy. Extensive experiments show that processing characters through formation trees aligns better with their inherent properties than direct sequential methods, significantly enhancing the generality and usability of the representations.
Paper Structure (35 sections, 4 equations, 5 figures, 3 tables)

This paper contains 35 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of radical-based sequences, which preserves the hierarchical knowledge in the form of the decomposition tree. Existing approach encode them in a sequential manner using look-ahead mask, while our approach transform the decomposition tree into a formation tree with fixed edge direction and formation type-related edge attributes.
  • Figure 2: FT-CLIP jointly trains Image Encoder and Tree Encoder to predict the correct pairings of a batch of training examples. ViT is employed as image encoder with random masking, a novel Formation-Tree Transformer is proposed as tree encoder.
  • Figure 3: Twelve formation types used in FT-CLIP are defined in the top line, while the corresponding examples are illustrated below them. The bottom illustrates the azimuths as the red part of each corresponding formation type. Azimuth names are defined based on the abbreviations of their corresponding formation type and their indexes in corresponding formations.
  • Figure 4: An illustration of the proposed SubTree Encoding and Azimuth Encoding in our tree encoder. SubTree Encoding limits the self-attention to a node and its direct children, ignoring all other nodes (marked with grey). The type of edge between a node and its children is defined by the azimuth of the child node. Azimuth Encoding utilizes the azimuth of each node, as an additional feature of the node embedding.
  • Figure 5: Accuracy vs. training time trade-off by image mask ration in detail. Four kinds of mask ratios are tested in five different cases of character zero-shot recognition tasks. Different mask ratios are marked with different colors. $m$ indicates the total number of categories of characters are served as training samples. In each group of experiments, all parameters remain constant, except for the mask ratio.