Table of Contents
Fetching ...

CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition

Jinzhi Zheng, Ruyi Ji, Libo Zhang, Yanjun Wu, Chen Zhao

TL;DR

This work tackles irregular scene text recognition by introducing CMFN, a cross-modal fusion network that injects visual cues into semantic mining. CMFN architecture comprises a position self-enhanced encoder, a visual recognition branch, and an iterative semantic recognition branch to fuse visual and semantic information in multiple iterations. Empirical results show CMFN achieving state-of-the-art or competitive performance on irregular datasets (IC15, SVTP, CUTE) while maintaining strong results on regular datasets, with ablation demonstrating the benefits of visual cues and the fusion gate. The approach offers a robust, scalable path for recognizing irregular scene text by harmonizing visual cues and semantic reasoning, with potential extensions toward knowledge reasoning.

Abstract

Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.

CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition

TL;DR

This work tackles irregular scene text recognition by introducing CMFN, a cross-modal fusion network that injects visual cues into semantic mining. CMFN architecture comprises a position self-enhanced encoder, a visual recognition branch, and an iterative semantic recognition branch to fuse visual and semantic information in multiple iterations. Empirical results show CMFN achieving state-of-the-art or competitive performance on irregular datasets (IC15, SVTP, CUTE) while maintaining strong results on regular datasets, with ablation demonstrating the benefits of visual cues and the fusion gate. The approach offers a robust, scalable path for recognizing irregular scene text by harmonizing visual cues and semantic reasoning, with potential extensions toward knowledge reasoning.

Abstract

Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.
Paper Structure (13 sections, 16 equations, 6 figures, 3 tables)

This paper contains 13 sections, 16 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of different scene text recognition methods related to our algorithm. (a) Visual recognition methods. (b) Recognition method of visual module series language module. (c) The method of visual module recognition is modified by the language module. (D) Our scene text recognition method(CMFN). Our CMFN fuses visual cues in the language module when mining semantic information.
  • Figure 2: The overall architecture of CMFN, comprises a position self-enhanced encoder, a visual recognition branch, and an iterative semantic recognition branch. The dashed arrow indicates the direction of attention maps $AT_m$ as visual cues transmission. The blue arrows represent the iterative process.
  • Figure 3: The structure of the Mult-Head self-mask attention 9 and Mult-Head position enhanced self-mask attention. MTDC is short for Matrix multiplication, Transpose, Division, Channel square root. ${AT}_m$ comes from the visual recognition branch, and $L_{vc}$ is the representation of visual cues in the semantic space of the text.
  • Figure 4: Visualization of $L_{vc}$ for two scene text instances ("PAIN" and "Root"). The first figure in each row represents a scene text example, followed by the visual cues $L_{vc}$ representation corresponding to each character in the text. In the visualization diagram, the horizontal axis represents feature dimensions and the vertical axis represents the corresponding feature values.
  • Figure 5: Text recognition accuracy of different iteration numbers. TOTAL indicates the statistic result of the three corresponding scene text datasets as a whole.
  • ...and 1 more figures