Table of Contents
Fetching ...

HAAP: Vision-context Hierarchical Attention Autoregressive with Adaptive Permutation for Scene Text Recognition

Honghui Chen, Yuhang Qiu, Jiabao Wang, Pingping Chen, Nam Ling

TL;DR

The paper addresses scene text recognition under challenging visual conditions by improving internal language modeling through adaptive cross-modal interaction. It introduces Implicit Permutation Neurons (IPN) to generate adaptive attention masks and Cross-modal Hierarchical Attention (CHA) to fuse position, context, and visual information, enabling autoregressive decoding without iterative refinement. Empirical results on synthetic and real STR benchmarks show state-of-the-art accuracy with favorable latency and parameter efficiency, including robustness to occlusion and irregular text shapes. This approach advances real-time, end-to-end STR by leveraging vision–context dependencies in a unified autoregressive framework.

Abstract

Scene Text Recognition (STR) is challenging in extracting effective character representations from visual data when text is unreadable. Permutation language modeling (PLM) is introduced to refine character predictions by jointly capturing contextual and visual information. However, in PLM, the use of random permutations causes training fit oscillation, and the iterative refinement (IR) operation also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance position-context-image interaction capability, improving autoregressive LM generalization. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks that dynamically exploit token dependencies, enhancing the correlation between visual information and context. Adaptive correlation representation helps the model avoid training fit oscillation. Second, the Cross-modal Hierarchical Attention mechanism (CHA) is introduced to capture the dependencies among position queries, contextual semantics and visual information. CHA enables position tokens to aggregate global semantic information, avoiding the need for IR. Extensive experimental results show that the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.

HAAP: Vision-context Hierarchical Attention Autoregressive with Adaptive Permutation for Scene Text Recognition

TL;DR

The paper addresses scene text recognition under challenging visual conditions by improving internal language modeling through adaptive cross-modal interaction. It introduces Implicit Permutation Neurons (IPN) to generate adaptive attention masks and Cross-modal Hierarchical Attention (CHA) to fuse position, context, and visual information, enabling autoregressive decoding without iterative refinement. Empirical results on synthetic and real STR benchmarks show state-of-the-art accuracy with favorable latency and parameter efficiency, including robustness to occlusion and irregular text shapes. This approach advances real-time, end-to-end STR by leveraging vision–context dependencies in a unified autoregressive framework.

Abstract

Scene Text Recognition (STR) is challenging in extracting effective character representations from visual data when text is unreadable. Permutation language modeling (PLM) is introduced to refine character predictions by jointly capturing contextual and visual information. However, in PLM, the use of random permutations causes training fit oscillation, and the iterative refinement (IR) operation also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance position-context-image interaction capability, improving autoregressive LM generalization. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks that dynamically exploit token dependencies, enhancing the correlation between visual information and context. Adaptive correlation representation helps the model avoid training fit oscillation. Second, the Cross-modal Hierarchical Attention mechanism (CHA) is introduced to capture the dependencies among position queries, contextual semantics and visual information. CHA enables position tokens to aggregate global semantic information, avoiding the need for IR. Extensive experimental results show that the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.
Paper Structure (16 sections, 14 equations, 12 figures, 7 tables)

This paper contains 16 sections, 14 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Illustration of IPN. The solid line represents the process of mask generation i.e. non-linear weighted mapping of left-to-right permutations. The dashed line represents the interpretation of the mask generation: the visual information guides the adaptive mask to learn the inter-correlation of the contextual positions.
  • Figure 2: The basic flow of STR. (a) Visual feature coding and decoding (b) Joint visual and context representation based on external LM. (c) Joint visual and context representation based on internal LM. (d) Internal LM-based visual-context representation without Iterative Refinement (IR) (Ours).
  • Figure 3: The pipeline of HAAP. (a) Image and text inputs are represented as a series of patches and semantic tokens. (b) Visual information guides the IPN to assign bidirectional and adaptive masks to the context. (c) MHA is used to perform hierarchical image-context interaction and decoding.
  • Figure 4: Illustration of a ViT layer from Dosovitskiy et al. dosovitskiy2020image. $LN$ pertains to layer normalization. $MLP$ represents Multilayer Perceptron.
  • Figure 5: Comparison of qualitative results on challenging textual data including blurring, distortion, occlusion, low resolution, perspective-shifting, and multi-directionality.
  • ...and 7 more figures