Table of Contents
Fetching ...

Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition

Qiuming Luo, Tao Zeng, Feng Li, Heming Liu, Rui Mao, Chang Kong

TL;DR

The paper tackles zero-shot handwritten Chinese character recognition by addressing two key gaps: uneven discriminative value among radicals and coarse cross-modal alignment. It introduces Entropy-Aware Structural Alignment Network, featuring an entropy-guided position embedding, Dual-View Radical Trees for multi-granularity topology, and a Top-K semantic feature fusion mechanism paired with cross-modal attention to align visuals with semantic prototypes. The approach achieves state-of-the-art zero-shot performance on standard benchmarks, exhibits strong few-shot data efficiency, and maintains high inference speed by using offline precomputed representations. This framework advances robust recognition of unseen characters with limited data, offering scalable benefits for large-vocabulary ideographic scripts and potential extension to end-to-end text-line recognition.

Abstract

Zero-shot Handwritten Chinese Character Recognition (HCCR) aims to recognize unseen characters by leveraging radical-based semantic compositions. However, existing approaches often treat characters as flat radical sequences, neglecting the hierarchical topology and the uneven information density of different components. To address these limitations, we propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling. First, we introduce an Information Entropy Prior to dynamically modulate positional embeddings via multiplicative interaction, acting as a saliency detector that prioritizes discriminative roots over ubiquitous components. Second, we construct a Dual-View Radical Tree to extract multi-granularity structural features, which are integrated via an adaptive Sigmoid-based gating network to encode both global layout and local spatial roles. Finally, a Top-K Semantic Feature Fusion mechanism is devised to augment the decoding process by utilizing the centroid of semantic neighbors, effectively rectifying visual ambiguities through feature-level consensus. Extensive experiments demonstrate that our method establishes new state-of-the-art performance, significantly outperforming existing CLIP-based baselines in the challenging zero-shot setting. Furthermore, the framework exhibits exceptional data efficiency, demonstrating rapid adaptability with minimal support samples.

Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition

TL;DR

The paper tackles zero-shot handwritten Chinese character recognition by addressing two key gaps: uneven discriminative value among radicals and coarse cross-modal alignment. It introduces Entropy-Aware Structural Alignment Network, featuring an entropy-guided position embedding, Dual-View Radical Trees for multi-granularity topology, and a Top-K semantic feature fusion mechanism paired with cross-modal attention to align visuals with semantic prototypes. The approach achieves state-of-the-art zero-shot performance on standard benchmarks, exhibits strong few-shot data efficiency, and maintains high inference speed by using offline precomputed representations. This framework advances robust recognition of unseen characters with limited data, offering scalable benefits for large-vocabulary ideographic scripts and potential extension to end-to-end text-line recognition.

Abstract

Zero-shot Handwritten Chinese Character Recognition (HCCR) aims to recognize unseen characters by leveraging radical-based semantic compositions. However, existing approaches often treat characters as flat radical sequences, neglecting the hierarchical topology and the uneven information density of different components. To address these limitations, we propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling. First, we introduce an Information Entropy Prior to dynamically modulate positional embeddings via multiplicative interaction, acting as a saliency detector that prioritizes discriminative roots over ubiquitous components. Second, we construct a Dual-View Radical Tree to extract multi-granularity structural features, which are integrated via an adaptive Sigmoid-based gating network to encode both global layout and local spatial roles. Finally, a Top-K Semantic Feature Fusion mechanism is devised to augment the decoding process by utilizing the centroid of semantic neighbors, effectively rectifying visual ambiguities through feature-level consensus. Extensive experiments demonstrate that our method establishes new state-of-the-art performance, significantly outperforming existing CLIP-based baselines in the challenging zero-shot setting. Furthermore, the framework exhibits exceptional data efficiency, demonstrating rapid adaptability with minimal support samples.
Paper Structure (38 sections, 12 equations, 8 figures, 5 tables)

This paper contains 38 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The overall architecture of the proposed Entropy-Aware Structural Alignment Network. The framework consists of three input branches and a central matching mechanism: (1) The Visual Branch (top-left) employs a ResNet-based backbone to extract feature maps from handwritten character images. (2) The Radical Image Branch (middle-left) utilizes our Multimodal Radical Encoder to extract visual-aligned radical features. (3) The IDS Branch (bottom-left) constructs semantic representations using Entropy-Aware Multiplicative Modulation and Dual-View Radical Tree Embeddings. The core component (highlighted in blue) is the Radical Semantic Matching Module, which aligns visual features with five distinct structural representations via an Adaptive Sigmoid-GateFusion. Furthermore, a Top-K Semantic Feature Fusion strategy is employed to augment the decoder's query with robust semantic priors. Finally, the Transformer Decoder (right) processes the enhanced features to generate the final recognition result.
  • Figure 2: Illustration of the proposed Multi-grid 2D Elastic Deformation. Unlike traditional 1D strip-based methods, our approach constructs a dense 2D elastic mesh over the radical image. As visualized, the control points $p_{m,n}$ (red dots) are independently perturbed in both horizontal and vertical directions based on Gaussian random sampling. This mechanism explicitly models realistic non-rigid variations, such as local stroke squeezing and perspective warping, thereby enriching the structural diversity of the training samples.
  • Figure 3: Visualization of the Information Entropy Statistics for over 400 radicals in the vocabulary. Each rectangle represents a unique radical. The color spectrum indicates the information content: (1) Deep Red ($\text{Entropy} \approx 9$) corresponds to rare radicals with high information density; (2) Deep Blue ($\text{Entropy} \approx 2$) corresponds to frequently occurring radicals with low information content; (3) The dominance of Red hues ($\text{Entropy} \in [6, 9]$) indicates that the majority of radicals possess significant discriminative power. This unbalanced distribution serves as the statistical basis for our Entropy-Aware Position Embedding.
  • Figure 4: Illustration of the Dual-View Radical Tree Modeling using the character "深" (deep). (Left) Parent-centric Global View: This view models the bottom-up information aggregation. The root node '$\mathrel{\ooalign{$□$\cr \hidewidth\raisebox{0.38ex}{[0.65]{$|$}}\hidewidth\cr }}$' aggregates semantic features from its left child '氵' and right child (top of '穴' + bottom of '木'), thereby capturing the global composition of the entire character. (Right) Child-centric Local View: This view focuses on local structural roles. Taking the node '木' (wood) as an example, it explicitly attends to its direct parent node '$\boxminus$' (above-below structure) and identifies itself as the bottom component ('footer'), capturing fine-grained local topological information.
  • Figure 5: Architecture of the Adaptive GateFusion Network. The module takes four heterogeneous structural embeddings ($\mathbf{V}_{ent}, \mathbf{F}_{depth}, \mathbf{F}_{parent}, \mathbf{F}_{child}$) as input. Each branch undergoes a linear alignment followed by a Sigmoid-based gating mechanism to determine its element-wise contribution. The gated features are aggregated via summation, and the radical content embedding ($\mathbf{F}_{code}$) is finally injected to preserve the fundamental semantic identity. The output $\mathbf{P}_{sem}$ represents the fused semantic prototypes used for character recognition. This design allows the model to dynamically prioritize critical structural views while maintaining robust radical recognition.
  • ...and 3 more figures