Table of Contents
Fetching ...

GloTSFormer: Global Video Text Spotting Transformer

Han Wang, Yanjie Wang, Yang Li, Can Huang

TL;DR

GloTSFormer reframes Video Text Spotting as a global association problem across frames, leveraging a global embedding pool and a Transformer-based memory mechanism to model long-range temporal context. It introduces a morphology-aware, Wasserstein distance for cross-frame matching by modeling text polygons as Gaussian distributions, enabling robust associations under rapid motion and camera changes. The system jointly optimizes detection, recognition, and tracking with a multi-task loss and performs frame-wise trajectory construction using a Hungarian solver, achieving state-of-the-art results on ICDAR2015 video and strong performance on other benchmarks while maintaining practical speed. Overall, the method demonstrates that explicit global temporal modeling and morphology-aware distance metrics substantially improve reliability in video text spotting and tracking, with potential impact on video understanding tasks.

Abstract

Video Text Spotting (VTS) is a fundamental visual task that aims to predict the trajectories and content of texts in a video. Previous works usually conduct local associations and apply IoU-based distance and complex post-processing procedures to boost performance, ignoring the abundant temporal information and the morphological characteristics in VTS. In this paper, we propose a novel Global Video Text Spotting Transformer GloTSFormer to model the tracking problem as global associations and utilize the Gaussian Wasserstein distance to guide the morphological correlation between frames. Our main contributions can be summarized as three folds. 1). We propose a Transformer-based global tracking method GloTSFormer for VTS and associate multiple frames simultaneously. 2). We introduce a Wasserstein distance-based method to conduct positional associations between frames. 3). We conduct extensive experiments on public datasets. On the ICDAR2015 video dataset, GloTSFormer achieves 56.0 MOTA with 4.6 absolute improvement compared with the previous SOTA method and outperforms the previous Transformer-based method by a significant 8.3 MOTA.

GloTSFormer: Global Video Text Spotting Transformer

TL;DR

GloTSFormer reframes Video Text Spotting as a global association problem across frames, leveraging a global embedding pool and a Transformer-based memory mechanism to model long-range temporal context. It introduces a morphology-aware, Wasserstein distance for cross-frame matching by modeling text polygons as Gaussian distributions, enabling robust associations under rapid motion and camera changes. The system jointly optimizes detection, recognition, and tracking with a multi-task loss and performs frame-wise trajectory construction using a Hungarian solver, achieving state-of-the-art results on ICDAR2015 video and strong performance on other benchmarks while maintaining practical speed. Overall, the method demonstrates that explicit global temporal modeling and morphology-aware distance metrics substantially improve reliability in video text spotting and tracking, with potential impact on video understanding tasks.

Abstract

Video Text Spotting (VTS) is a fundamental visual task that aims to predict the trajectories and content of texts in a video. Previous works usually conduct local associations and apply IoU-based distance and complex post-processing procedures to boost performance, ignoring the abundant temporal information and the morphological characteristics in VTS. In this paper, we propose a novel Global Video Text Spotting Transformer GloTSFormer to model the tracking problem as global associations and utilize the Gaussian Wasserstein distance to guide the morphological correlation between frames. Our main contributions can be summarized as three folds. 1). We propose a Transformer-based global tracking method GloTSFormer for VTS and associate multiple frames simultaneously. 2). We introduce a Wasserstein distance-based method to conduct positional associations between frames. 3). We conduct extensive experiments on public datasets. On the ICDAR2015 video dataset, GloTSFormer achieves 56.0 MOTA with 4.6 absolute improvement compared with the previous SOTA method and outperforms the previous Transformer-based method by a significant 8.3 MOTA.
Paper Structure (19 sections, 10 equations, 4 figures, 5 tables)

This paper contains 19 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Motivation. Previous works usually conduct local associations and easily fail in scenes with interference (e.g., identical texts). To solve the problems, we introduce global associations to utilize temporal information to make our method more robust towards such scenes.
  • Figure 2: Overview. A global embedding pool is maintained to store historical tracking embeddings and trajectory information and is updated after each frame. With a shallow Transformer layer, we conduct associations between embeddings of the current frame and embeddings in the global embedding pool to obtain the global association score. Furthermore, a Wasserstein distance-based method is applied to measure the positional similarity between texts in frames. Some detailed architectures are ignored for clarity.
  • Figure 3: Three cases to demonstrate the effectiveness of Wasserstein distance. IoU-based distance and Wasserstein distance both succeed in Case 1. But in Case 2 and Case 3, the fast movements result in the poor performance of IoU-based distance, where Wasserstein distance produces more steady results by considering both location and morphology.
  • Figure 4: We demonstrate the results of previous Transformer-based methods TransVTSpotterTransDETR and our GloTSFormer. Different IDs are represented in different colors. Some of the false results (e.g., FNs, ID switches, and IDFs) are marked with a dotted red circle and pointed out by a red arrow. Apparently, our GloTSFormer performs better than previous Transformer-based methods especially in crowded scenes.