Table of Contents
Fetching ...

Representing Online Handwriting for Recognition in Large Vision-Language Models

Anastasiia Fadeeva, Philippe Schlattner, Andrii Maksai, Mark Collier, Efi Kokiopoulou, Jesse Berent, Claudiu Musat

TL;DR

The paper tackles the problem that naive OCR of online handwriting underperforms when integrated with large vision-language models. It introduces a dual ink representation that combines a time-ordered sequence of text tokens with an image rendering of the ink, enabling effective use with off-the-shelf VLMs without architectural changes. Across two VLM families (PaLI and PaLM-E) and three public datasets, the approach achieves competitive or superior CER results to state-of-the-art baselines, with ablations identifying that multimodal input, relative coordinate tokenization, and time+distance image rendering are key factors. The method supports both full fine-tuning and parameter-efficient tuning (e.g., LoRA), and demonstrates strong generalization across datasets and tokenization strategies, highlighting a practical path to adding handwriting recognition to pre-trained VLMs.

Abstract

The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.

Representing Online Handwriting for Recognition in Large Vision-Language Models

TL;DR

The paper tackles the problem that naive OCR of online handwriting underperforms when integrated with large vision-language models. It introduces a dual ink representation that combines a time-ordered sequence of text tokens with an image rendering of the ink, enabling effective use with off-the-shelf VLMs without architectural changes. Across two VLM families (PaLI and PaLM-E) and three public datasets, the approach achieves competitive or superior CER results to state-of-the-art baselines, with ablations identifying that multimodal input, relative coordinate tokenization, and time+distance image rendering are key factors. The method supports both full fine-tuning and parameter-efficient tuning (e.g., LoRA), and demonstrates strong generalization across datasets and tokenization strategies, highlighting a practical path to adding handwriting recognition to pre-trained VLMs.

Abstract

The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.
Paper Structure (26 sections, 1 equation, 7 figures, 12 tables)

This paper contains 26 sections, 1 equation, 7 figures, 12 tables.

Figures (7)

  • Figure 1: PaLI and PaLM-E architectures for ink recognition.
  • Figure 2: The full pipeline for the sequence representation in VLMs. This pipeline includes time sampling, scale normalization, discretization with uniform grid and representation of points with two coordinates in text.
  • Figure 3: Examples of different rendering options. Rendering options for color – black&white, time from Eq. \ref{['eq:speed_rendering']} and time+distance from Eq. \ref{['eq:speed_rendering']}. Examples of rendering in one, two or four lines.
  • Figure 4: Examples from DeepWriting, MathWriting and VNonDB datasets.
  • Figure 5: PaLI recognition on four examples where prediction only on image or ink is different from the target. We compare PaLI results to the ground truth from the MathWriting dataset. Mistakes include mixing similar characters like "tau" and "T", "d" and "a". We show that those mistakes are addressed by including ink representation.
  • ...and 2 more figures