Table of Contents
Fetching ...

Few-shot Writer Adaptation via Multimodal In-Context Learning

Tom Simon, Stephane Nicolas, Pierrick Tranouez, Clement Chatelain, Thierry Paquet

Abstract

While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.

Few-shot Writer Adaptation via Multimodal In-Context Learning

Abstract

While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.

Paper Structure

This paper contains 21 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Qualitative examples of inter-writer variation in the IAM dataset, highlighting writer-specific characteristics in letter formation.
  • Figure 2: By integrating context lines $X_c$ and their corresponding labels $Y_c$ from the same author as the query $X$, our context-driven model effectively captures the writer's stylistic characteristics. Leveraging these few in-context examples significantly reduces errors caused by writer-specific variations.
  • Figure 3: Illustration of our Context-driven architecture, structured around three core components: (1) a Context-Aware Tokenizer (2) a CNN encoder and (3) a Transformer decoder
  • Figure 4: Confidence-based fusion of Context-Driven and Standard OCR predictions. $\langle \text{ooc} \rangle$ prediction is represented by the symbol '*'
  • Figure 5: $\langle \text{out-of-context}\rangle$ token rate vs. number of context lines for IAM and RIMES.
  • ...and 3 more figures