DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Changhun Kim; Martin Mayr; Thomas Gorges; Fei Wu; Mathias Seuret; Andreas Maier; Vincent Christlein

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Changhun Kim, Martin Mayr, Thomas Gorges, Fei Wu, Mathias Seuret, Andreas Maier, Vincent Christlein

TL;DR

DRetHTR demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency, and proposes layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers.

Abstract

State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

TL;DR

Abstract

Paper Structure (27 sections, 26 equations, 7 figures, 12 tables)

This paper contains 27 sections, 26 equations, 7 figures, 12 tables.

Introduction
Related Work
Retentive Networks -- Background
Methodology
DRetHTR Architecture
Image Embedding Module
Text Embedding Module
DRetHTR Decoder
Attention-Retention Modality Fusion (ARMF)
Multi-Scale ARMF (MARMF) with Layer-wise Gamma Scaling
Evaluation
IAM Handwriting Database
Synthetic Dataset for Pre-training
Data Preprocessing and Augmentation
Implementation Details
...and 12 more sections

Figures (7)

Figure 1: Decoder-only RetNet architecture that fuses the image and text in the Decoder
Figure 2: Illustration of the DRetHTR training and inference processes: (a) training; (b) inference. ARMF mixes softmax (image) + retention (text) without breaking recurrence.
Figure 3: Examples of text length variability in the IAM dataset, from short words to long sentences.
Figure 4: Examples of the six augmentations applied to a handwriting image.
Figure 5: Local-to-global progression and its retention mimic. (a) In the Transformer baseline (DTrHTR), attention naturally shifts from local to broader context with depth. (b) By assigning smaller $\gamma$ to early decoder layers and larger $\gamma$ to deeper layers, retention reproduces a similar progression without softmax for text–text interactions (cf. Table \ref{['tab:gamma_cer_comparison']}).
...and 2 more figures

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

TL;DR

Abstract

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)