Table of Contents
Fetching ...

OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

Lixu Sun, Nurmemet Yolwas, Wushour Silamu

TL;DR

This work tackles the fragility of scene text recognition (STR) under real-world clutter by proposing OTSNet, a neurocognitive-inspired Observe–Thinking–Spelling pipeline. It unifies visual and semantic reasoning through four modules: Dual Attention Macaron Encoder (DAME) for refined visual features, Position-Aware Module (PAM) and Semantic Quantizer (SQ) for position-aware semantic abstraction, and Multi-Modal Collaborative Verifier (MMCV) for cross-modal self-correction. The model introduces a differential attention mechanism, discrete glyph semantics, and a triadic verification scheme, achieving state-of-the-art results on Union14M-L (83.5% avg) and OST (79.1%), while maintaining strong performance on challenging geometric distortions and occlusions. Overall, OTSNet demonstrates the practical value of a cognitive-inspired, unified vision–language STR framework with robust open-set and degraded-text capabilities.

Abstract

Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a Position-Aware Module (PAM) and Semantic Quantizer (SQ) that jointly integrate spatial context with glyph-level semantic abstraction via adaptive sampling; and (3) a Multi-Modal Collaborative Verifier (MMCV) that enforces self-correction through cross-modal fusion of visual, semantic, and character-level features. Extensive experiments demonstrate that OTSNet achieves state-of-the-art performance, attaining 83.5% average accuracy on the challenging Union14M-L benchmark and 79.1% on the heavily occluded OST dataset-establishing new records across 9 out of 14 evaluation scenarios.

OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

TL;DR

This work tackles the fragility of scene text recognition (STR) under real-world clutter by proposing OTSNet, a neurocognitive-inspired Observe–Thinking–Spelling pipeline. It unifies visual and semantic reasoning through four modules: Dual Attention Macaron Encoder (DAME) for refined visual features, Position-Aware Module (PAM) and Semantic Quantizer (SQ) for position-aware semantic abstraction, and Multi-Modal Collaborative Verifier (MMCV) for cross-modal self-correction. The model introduces a differential attention mechanism, discrete glyph semantics, and a triadic verification scheme, achieving state-of-the-art results on Union14M-L (83.5% avg) and OST (79.1%), while maintaining strong performance on challenging geometric distortions and occlusions. Overall, OTSNet demonstrates the practical value of a cognitive-inspired, unified vision–language STR framework with robust open-set and degraded-text capabilities.

Abstract

Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a Position-Aware Module (PAM) and Semantic Quantizer (SQ) that jointly integrate spatial context with glyph-level semantic abstraction via adaptive sampling; and (3) a Multi-Modal Collaborative Verifier (MMCV) that enforces self-correction through cross-modal fusion of visual, semantic, and character-level features. Extensive experiments demonstrate that OTSNet achieves state-of-the-art performance, attaining 83.5% average accuracy on the challenging Union14M-L benchmark and 79.1% on the heavily occluded OST dataset-establishing new records across 9 out of 14 evaluation scenarios.

Paper Structure

This paper contains 20 sections, 15 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Graphical abstract of OTSNet: A Unified Observation-Thinking-Spelling Network for Scene Text Recognition
  • Figure 2: An overview of OTSNet. OTSNet divides the input image into patches and extracts low-level features, followed by the DAME's deep visual feature extraction to capture fine-grained details. The PAM then fuses positional and visual features via SQ to form glyph semantic features. Finally, MMCV integrates visual, glyph semantic, and character features for joint modeling, producing the final recognition output.
  • Figure 3: Architecture of the observation stage in OTSNet. (a) The Dual Attention Macaron Encoder (DAME), which interleaves standard MHA and proposed DMHA blocks in a Macaron-style structure. (b) Internal design of the Differential Multi-Head Attention (DMHA) block. (c) The Dual-QK Subtractive Attention mechanism, which enhances local discriminability via subtraction of two independent attention maps.
  • Figure 4: Schematic diagram of the Semantic Quantizer (SQ) workflow.
  • Figure 5: Architecture of the MMCV.
  • ...and 5 more figures