GloTSFormer: Global Video Text Spotting Transformer
Han Wang, Yanjie Wang, Yang Li, Can Huang
TL;DR
GloTSFormer reframes Video Text Spotting as a global association problem across frames, leveraging a global embedding pool and a Transformer-based memory mechanism to model long-range temporal context. It introduces a morphology-aware, Wasserstein distance for cross-frame matching by modeling text polygons as Gaussian distributions, enabling robust associations under rapid motion and camera changes. The system jointly optimizes detection, recognition, and tracking with a multi-task loss and performs frame-wise trajectory construction using a Hungarian solver, achieving state-of-the-art results on ICDAR2015 video and strong performance on other benchmarks while maintaining practical speed. Overall, the method demonstrates that explicit global temporal modeling and morphology-aware distance metrics substantially improve reliability in video text spotting and tracking, with potential impact on video understanding tasks.
Abstract
Video Text Spotting (VTS) is a fundamental visual task that aims to predict the trajectories and content of texts in a video. Previous works usually conduct local associations and apply IoU-based distance and complex post-processing procedures to boost performance, ignoring the abundant temporal information and the morphological characteristics in VTS. In this paper, we propose a novel Global Video Text Spotting Transformer GloTSFormer to model the tracking problem as global associations and utilize the Gaussian Wasserstein distance to guide the morphological correlation between frames. Our main contributions can be summarized as three folds. 1). We propose a Transformer-based global tracking method GloTSFormer for VTS and associate multiple frames simultaneously. 2). We introduce a Wasserstein distance-based method to conduct positional associations between frames. 3). We conduct extensive experiments on public datasets. On the ICDAR2015 video dataset, GloTSFormer achieves 56.0 MOTA with 4.6 absolute improvement compared with the previous SOTA method and outperforms the previous Transformer-based method by a significant 8.3 MOTA.
