Table of Contents
Fetching ...

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Changhun Kim, Martin Mayr, Thomas Gorges, Fei Wu, Mathias Seuret, Andreas Maier, Vincent Christlein

TL;DR

DRetHTR demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency, and proposes layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers.

Abstract

State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

TL;DR

DRetHTR demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency, and proposes layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers.

Abstract

State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.
Paper Structure (27 sections, 26 equations, 7 figures, 12 tables)

This paper contains 27 sections, 26 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Decoder-only RetNet architecture that fuses the image and text in the Decoder
  • Figure 2: Illustration of the DRetHTR training and inference processes: (a) training; (b) inference. ARMF mixes softmax (image) + retention (text) without breaking recurrence.
  • Figure 3: Examples of text length variability in the IAM dataset, from short words to long sentences.
  • Figure 4: Examples of the six augmentations applied to a handwriting image.
  • Figure 5: Local-to-global progression and its retention mimic. (a) In the Transformer baseline (DTrHTR), attention naturally shifts from local to broader context with depth. (b) By assigning smaller $\gamma$ to early decoder layers and larger $\gamma$ to deeper layers, retention reproduces a similar progression without softmax for text–text interactions (cf. Table \ref{['tab:gamma_cer_comparison']}).
  • ...and 2 more figures