OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

Lixu Sun; Nurmemet Yolwas; Wushour Silamu

OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

Lixu Sun, Nurmemet Yolwas, Wushour Silamu

TL;DR

This work tackles the fragility of scene text recognition (STR) under real-world clutter by proposing OTSNet, a neurocognitive-inspired Observe–Thinking–Spelling pipeline. It unifies visual and semantic reasoning through four modules: Dual Attention Macaron Encoder (DAME) for refined visual features, Position-Aware Module (PAM) and Semantic Quantizer (SQ) for position-aware semantic abstraction, and Multi-Modal Collaborative Verifier (MMCV) for cross-modal self-correction. The model introduces a differential attention mechanism, discrete glyph semantics, and a triadic verification scheme, achieving state-of-the-art results on Union14M-L (83.5% avg) and OST (79.1%), while maintaining strong performance on challenging geometric distortions and occlusions. Overall, OTSNet demonstrates the practical value of a cognitive-inspired, unified vision–language STR framework with robust open-set and degraded-text capabilities.

Abstract

Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a Position-Aware Module (PAM) and Semantic Quantizer (SQ) that jointly integrate spatial context with glyph-level semantic abstraction via adaptive sampling; and (3) a Multi-Modal Collaborative Verifier (MMCV) that enforces self-correction through cross-modal fusion of visual, semantic, and character-level features. Extensive experiments demonstrate that OTSNet achieves state-of-the-art performance, attaining 83.5% average accuracy on the challenging Union14M-L benchmark and 79.1% on the heavily occluded OST dataset-establishing new records across 9 out of 14 evaluation scenarios.

OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

TL;DR

Abstract

OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)