Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition
Qiuming Luo, Tao Zeng, Feng Li, Heming Liu, Rui Mao, Chang Kong
TL;DR
The paper tackles zero-shot handwritten Chinese character recognition by addressing two key gaps: uneven discriminative value among radicals and coarse cross-modal alignment. It introduces Entropy-Aware Structural Alignment Network, featuring an entropy-guided position embedding, Dual-View Radical Trees for multi-granularity topology, and a Top-K semantic feature fusion mechanism paired with cross-modal attention to align visuals with semantic prototypes. The approach achieves state-of-the-art zero-shot performance on standard benchmarks, exhibits strong few-shot data efficiency, and maintains high inference speed by using offline precomputed representations. This framework advances robust recognition of unseen characters with limited data, offering scalable benefits for large-vocabulary ideographic scripts and potential extension to end-to-end text-line recognition.
Abstract
Zero-shot Handwritten Chinese Character Recognition (HCCR) aims to recognize unseen characters by leveraging radical-based semantic compositions. However, existing approaches often treat characters as flat radical sequences, neglecting the hierarchical topology and the uneven information density of different components. To address these limitations, we propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling. First, we introduce an Information Entropy Prior to dynamically modulate positional embeddings via multiplicative interaction, acting as a saliency detector that prioritizes discriminative roots over ubiquitous components. Second, we construct a Dual-View Radical Tree to extract multi-granularity structural features, which are integrated via an adaptive Sigmoid-based gating network to encode both global layout and local spatial roles. Finally, a Top-K Semantic Feature Fusion mechanism is devised to augment the decoding process by utilizing the centroid of semantic neighbors, effectively rectifying visual ambiguities through feature-level consensus. Extensive experiments demonstrate that our method establishes new state-of-the-art performance, significantly outperforming existing CLIP-based baselines in the challenging zero-shot setting. Furthermore, the framework exhibits exceptional data efficiency, demonstrating rapid adaptability with minimal support samples.
