Table of Contents
Fetching ...

LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model

Hongen Liu, Di Sun, Jiahao Wang, Yi Liu, Gang Pan

TL;DR

LOGO addresses the core difficulty of video text spotting by combining language-aware scoring with glyph perception and position-aware tracking. The framework introduces a Language Synergy Classifier (LSC) trained offline on detector outputs to re-score text proposals by legibility, a Glyph Supervision branch with MSE-based learning to better recognize noisy glyphs, and a Visual Position Mixture Module (VPMM) that fuses spatial and visual cues for discriminative tracking features. When integrated with a detector like PP-YOLOE-R and a recognizer like ABINet, LOGO achieves improved MOTA/IDF1 across benchmarks such as ICDAR2015 Video and DSText, and shows the value of language-guided re-scoring in reducing false positives while preserving hard-to-detect texts. The results demonstrate meaningful gains without requiring costly end-to-end transformer training, highlighting practical impact for robust video text spotting in diverse scenes.

Abstract

Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, recent methods track the zero-shot results of state-of-the-art image text spotters directly, and achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited tracking trajectories on extreme dataset. Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements, albeit at the expense of considerable training resources. In this paper, we propose a Language Collaboration and Glyph Perception Model, termed LOGO, an innovative framework designed to enhance the performance of conventional text spotters. To achieve this goal, we design a language synergy classifier (LSC) to explicitly discern text instances from background noise in the recognition stage. Specially, the language synergy classifier can output text content or background code based on the legibility of text regions, thus computing language scores. Subsequently, fusion scores are computed by taking the average of detection scores and language scores, and are utilized to re-score the detection results before tracking. By the re-scoring mechanism, the proposed LSC facilitates the detection of low-resolution text instances while filtering out text-like regions. Moreover, the glyph supervision is introduced to enhance the recognition accuracy of noisy text regions. In addition, we propose the visual position mixture module, which can merge the position information and visual features efficiently, and acquire more discriminative tracking features. Extensive experiments on public benchmarks validate the effectiveness of the proposed method.

LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model

TL;DR

LOGO addresses the core difficulty of video text spotting by combining language-aware scoring with glyph perception and position-aware tracking. The framework introduces a Language Synergy Classifier (LSC) trained offline on detector outputs to re-score text proposals by legibility, a Glyph Supervision branch with MSE-based learning to better recognize noisy glyphs, and a Visual Position Mixture Module (VPMM) that fuses spatial and visual cues for discriminative tracking features. When integrated with a detector like PP-YOLOE-R and a recognizer like ABINet, LOGO achieves improved MOTA/IDF1 across benchmarks such as ICDAR2015 Video and DSText, and shows the value of language-guided re-scoring in reducing false positives while preserving hard-to-detect texts. The results demonstrate meaningful gains without requiring costly end-to-end transformer training, highlighting practical impact for robust video text spotting in diverse scenes.

Abstract

Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, recent methods track the zero-shot results of state-of-the-art image text spotters directly, and achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited tracking trajectories on extreme dataset. Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements, albeit at the expense of considerable training resources. In this paper, we propose a Language Collaboration and Glyph Perception Model, termed LOGO, an innovative framework designed to enhance the performance of conventional text spotters. To achieve this goal, we design a language synergy classifier (LSC) to explicitly discern text instances from background noise in the recognition stage. Specially, the language synergy classifier can output text content or background code based on the legibility of text regions, thus computing language scores. Subsequently, fusion scores are computed by taking the average of detection scores and language scores, and are utilized to re-score the detection results before tracking. By the re-scoring mechanism, the proposed LSC facilitates the detection of low-resolution text instances while filtering out text-like regions. Moreover, the glyph supervision is introduced to enhance the recognition accuracy of noisy text regions. In addition, we propose the visual position mixture module, which can merge the position information and visual features efficiently, and acquire more discriminative tracking features. Extensive experiments on public benchmarks validate the effectiveness of the proposed method.
Paper Structure (19 sections, 15 equations, 8 figures, 7 tables)

This paper contains 19 sections, 15 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The rethink to advantages and limitations of GoMatching GoMatching. (a) The video text spotting performance of ByteTrack bytetrack and TransDETR transdetr on ICDAR2015 Video. The spotting results of single frame in ByteTrack bytetrack are acquired using DeepSolo deepsolo, while MOTA MOTA and IDF1 IDF1 serve as the chosen evaluation metrics. (b) The number of most tracked trajectories from GoMatching GoMatching and TencentOCR for video text tracking and video text spotting on DSText.
  • Figure 2: Visualization of the detection results from PP-YOLOE-R and the re-scored results from Language Synergy Classifier. (a) PP-YOLOE-R. (b) Language Synergy Classifier (LSC). Confidence scores range from 0 to 1, with colors closer to red indicating higher confidence scores.
  • Figure 3: The overall pipeline of LOGO. In this pipeline, the YOLO based rotated detector is employed to improve training efficiency, and the Language synergy classifier (LSC) is designed to distinguish text instances from background noise based on language knowledge. Simultaneously, we introduce the glyph branch to enhance the model's perception for text structure, and propose the Visual Position Mixture Module (VPMM) to fuse the position information and visual features of the detection results. The fusion features will be fed into LST-Matcher for video text tracking.
  • Figure 4: The architecture of language synergy classifier. In the training phase, ground truths (GT) and prediction boxes with low Intersection over Union (IoU) with GT are defined as positive and negative samples, respectively. Next, we extract the text regions of these samples by RoI Rotate operation, and encode their recognition results. Then, these text regions and recognition encodings are fed into LSC for network optimization. In the inference phase, the language synergy classifier outputs language scores based on the recognition results. After that, the fusion scores are computed by taking the average of language scores and detection scores, and are utilized to recalibrate the detection results. Notably, "GT" represents the ground truths from training datasets, while $\langle /s \rangle$ and $\langle /p \rangle$ represent the end symbol and padding symbol, respectively.
  • Figure 5: The structure of glyph supervision. $F_{0}$, $F_{1}$, $F_{2}$ represent the features extracted from the backbone, while $G_{0}$ and $G_{1}$ denote the features from the glyph branch. Additionally, $S_{m}$ and $S_{pl}$ signify the segmentation masks and pseudo-labels, respectively.
  • ...and 3 more figures