TiCLS : Tightly Coupled Language Text Spotter

Leeje Jang; Yijun Lin; Yao-Yi Chiang; Jerod Weinman

TiCLS : Tightly Coupled Language Text Spotter

Leeje Jang, Yijun Lin, Yao-Yi Chiang, Jerod Weinman

TL;DR

TiCLS introduces a tightly coupled end-to-end scene text spotter that leverages a character-level PLM to inject external linguistic knowledge into vision-based text spotting. The model fuses a DETR-inspired visual backbone with a linguistically pretrained decoder initialized from a PLM trained on short, character-level sequences, enabling robust recognition of degraded or fragmented text. Empirical results on ICDAR 2015, Total-Text, and CTW1500 show state-of-the-art performance, with notable gains in lexicon-free settings and improved handling of long sequences and OOV words. The work advances scene text spotting by bridging visual cues with tailored linguistic priors, at the cost of increased model size and inference time, and suggests future improvements in efficiency and decoding strategies.

Abstract

Scene text spotting aims to detect and recognize text in real-world images, where instances are often short, fragmented, or visually ambiguous. Existing methods primarily rely on visual cues and implicitly capture local character dependencies, but they overlook the benefits of external linguistic knowledge. Prior attempts to integrate language models either adapt language modeling objectives without external knowledge or apply pretrained models that are misaligned with the word-level granularity of scene text. We propose TiCLS, an end-to-end text spotter that explicitly incorporates external linguistic knowledge from a character-level pretrained language model. TiCLS introduces a linguistic decoder that fuses visual and linguistic features, yet can be initialized by a pretrained language model, enabling robust recognition of ambiguous or fragmented text. Experiments on ICDAR 2015 and Total-Text demonstrate that TiCLS achieves state-of-the-art performance, validating the effectiveness of PLM-guided linguistic integration for scene text spotting.

TiCLS : Tightly Coupled Language Text Spotter

TL;DR

Abstract

Paper Structure (26 sections, 5 equations, 3 figures, 5 tables)

This paper contains 26 sections, 5 equations, 3 figures, 5 tables.

Introduction
Related Work
End-to-end Scene Text Spotting
Language-Aware Scene Text Spotting
Text Recognition with Language Modeling
TiCLS for Text Spotting
Visual Feature Extractor and Decoder
Pretrained Language Model for Scene Text
Linguistic Decoder
Optimization and Loss
Experiments and Results
Benchmarks for Scene Text Spotting
Datasets for PLM
Implementation and Training Details
Results and Discussion
...and 11 more sections

Figures (3)

Figure 1: Overall architecture of TiCLS. TiCLS builds on a DETR-style visual encoder (yellow) and visual decoder (orange). A linguistic decoder (blue) receives visual representations from the visual-to-language projection head and performs visual-linguistic fusion for text recognition (green). The linguistic decoder is initialized with our proposed PLM (Section \ref{['subsec:plmtrain']}).
Figure 2: Proposed PLM architecture. The PLM encoder (yellow block) generates contextualized embeddings from the corrupted text, and the PLM decoder (blue block) autoregressively generates text (green) based on the encoder outputs. TiCLS initializes its linguistic decoder with the PLM decoder (blue block).
Figure 3: Qualitative comparison on Total-Text (a and b) between TiCLS (first row) and DeepSolo (second row). Green (ours) and yellow (DeepSolo) indicate correct detection and recognition results, while red (DeepSolo) denotes incorrect recognition results.

TiCLS : Tightly Coupled Language Text Spotter

TL;DR

Abstract

TiCLS : Tightly Coupled Language Text Spotter

Authors

TL;DR

Abstract

Table of Contents

Figures (3)