HAAP: Vision-context Hierarchical Attention Autoregressive with Adaptive Permutation for Scene Text Recognition
Honghui Chen, Yuhang Qiu, Jiabao Wang, Pingping Chen, Nam Ling
TL;DR
The paper addresses scene text recognition under challenging visual conditions by improving internal language modeling through adaptive cross-modal interaction. It introduces Implicit Permutation Neurons (IPN) to generate adaptive attention masks and Cross-modal Hierarchical Attention (CHA) to fuse position, context, and visual information, enabling autoregressive decoding without iterative refinement. Empirical results on synthetic and real STR benchmarks show state-of-the-art accuracy with favorable latency and parameter efficiency, including robustness to occlusion and irregular text shapes. This approach advances real-time, end-to-end STR by leveraging vision–context dependencies in a unified autoregressive framework.
Abstract
Scene Text Recognition (STR) is challenging in extracting effective character representations from visual data when text is unreadable. Permutation language modeling (PLM) is introduced to refine character predictions by jointly capturing contextual and visual information. However, in PLM, the use of random permutations causes training fit oscillation, and the iterative refinement (IR) operation also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance position-context-image interaction capability, improving autoregressive LM generalization. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks that dynamically exploit token dependencies, enhancing the correlation between visual information and context. Adaptive correlation representation helps the model avoid training fit oscillation. Second, the Cross-modal Hierarchical Attention mechanism (CHA) is introduced to capture the dependencies among position queries, contextual semantics and visual information. CHA enables position tokens to aggregate global semantic information, avoiding the need for IR. Extensive experimental results show that the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.
