Col-OLHTR: A Novel Framework for Multimodal Online Handwritten Text Recognition
Chenyu Liu, Jinshui Hu, Baocai Yin, Jia Pan, Bing Yin, Jun Du, Qingfeng Liu
TL;DR
This work tackles OLHTR by addressing the trade-off between single-stream simplicity and multimodal feature richness. It introduces Col-OLHTR, a collaborative learning framework that trains with both trajectory- and image-based streams and a dedicated Point-to-Spatial Alignment (P2SA) module to learn image-level spatial cues from trajectory features via 2D rotary position embeddings and Transformer layers. A stop-gradient alignment mechanism and an auxiliary loss synchronize multimodal representations, while inference remains efficient, using only the trajectory stream augmented by P2SA. Empirically, Col-OLHTR achieves state-of-the-art results on multiple OLHTR benchmarks (IAM-OnDB, OnHW-WordsTraj, ICDAR2013-Online), demonstrating robustness across languages and writing styles and offering a practical balance between performance and efficiency.
Abstract
Online Handwritten Text Recognition (OLHTR) has gained considerable attention for its diverse range of applications. Current approaches usually treat OLHTR as a sequence recognition task, employing either a single trajectory or image encoder, or multi-stream encoders, combined with a CTC or attention-based recognition decoder. However, these approaches face several drawbacks: 1) single encoders typically focus on either local trajectories or visual regions, lacking the ability to dynamically capture relevant global features in challenging cases; 2) multi-stream encoders, while more comprehensive, suffer from complex structures and increased inference costs. To tackle this, we propose a Collaborative learning-based OLHTR framework, called Col-OLHTR, that learns multimodal features during training while maintaining a single-stream inference process. Col-OLHTR consists of a trajectory encoder, a Point-to-Spatial Alignment (P2SA) module, and an attention-based decoder. The P2SA module is designed to learn image-level spatial features through trajectory-encoded features and 2D rotary position embeddings. During training, an additional image-stream encoder-decoder is collaboratively trained to provide supervision for P2SA features. At inference, the extra streams are discarded, and only the P2SA module is used and merged before the decoder, simplifying the process while preserving high performance. Extensive experimental results on several OLHTR benchmarks demonstrate the state-of-the-art (SOTA) performance, proving the effectiveness and robustness of our design.
