A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition
Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal, Cheng-Lin Liu
TL;DR
This work tackles the resource-heavy nature of modern scene text recognition by proposing a training-free, context-driven pipeline that relies on an attention-guided segmentation network, block-level localization, and a pre-trained scene captioner (BLIP-2) to generate contextual cues. The approach evaluates three text outputs—T1 from the image, T2 from the scene description, and T3 from cropped text blocks—via multimodal similarity and lexical matching, selecting the most plausible prediction with a final score C = $\alpha S + \beta L$ and a threshold $\tau$. When confidence is insufficient, the method falls back to a heavy end-to-end recognizer (DeepSolo), effectively balancing speed and accuracy. Empirical results across ICDAR13-FST, ICDAR15-IST, and TotalText show competitive recognition performance with substantially reduced FLOPs, especially in context-rich scenes, while ablations validate the chosen parameter settings and the importance of coherent scene descriptions. The framework offers a practical path to real-time STR deployment on resource-constrained devices, though it notes limitations in multilingual handling and dense-text scenarios, guiding future multilingual fine-tuning and improved segmentation strategies.
Abstract
Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.
