Representing Online Handwriting for Recognition in Large Vision-Language Models
Anastasiia Fadeeva, Philippe Schlattner, Andrii Maksai, Mark Collier, Efi Kokiopoulou, Jesse Berent, Claudiu Musat
TL;DR
The paper tackles the problem that naive OCR of online handwriting underperforms when integrated with large vision-language models. It introduces a dual ink representation that combines a time-ordered sequence of text tokens with an image rendering of the ink, enabling effective use with off-the-shelf VLMs without architectural changes. Across two VLM families (PaLI and PaLM-E) and three public datasets, the approach achieves competitive or superior CER results to state-of-the-art baselines, with ablations identifying that multimodal input, relative coordinate tokenization, and time+distance image rendering are key factors. The method supports both full fine-tuning and parameter-efficient tuning (e.g., LoRA), and demonstrates strong generalization across datasets and tokenization strategies, highlighting a practical path to adding handwriting recognition to pre-trained VLMs.
Abstract
The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.
