VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Yuliang Liu; Mingxin Huang; Hao Yan; Linger Deng; Weijia Wu; Hao Lu; Chunhua Shen; Lianwen Jin; Xiang Bai

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu, Chunhua Shen, Lianwen Jin, Xiang Bai

TL;DR

VimTS tackles cross-domain text spotting by unifying word-level, line-level, and video-level tasks into a single framework. The approach introduces a Prompt Queries Generation Module and a Task-aware Adapter to enable multi-task learning with minimal parameter overhead, and leverages the VTD-368k synthetic video dataset generated with CoDeF to inject temporal dynamics. Empirical results show notable gains on image-to-image and image-to-video benchmarks, with average improvements of about 2.6% in H-mean and 5.5% in MOTA on video benchmarks, even when trained only on image-level data; additional gains are achieved with video data or VTD-368k. The work highlights the limits of large multimodal models for cross-domain scene text spotting and demonstrates that task-specific, low-resource approaches can achieve strong generalization.

Abstract

Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the https://VimTextSpotter.github.io.

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

TL;DR

Abstract

Paper Structure (28 sections, 9 equations, 12 figures, 13 tables)

This paper contains 28 sections, 9 equations, 12 figures, 13 tables.

Introduction
Related Work
Scene Text Spotting
Video Text Spotting
Domain Adaptation
Methodology
Feature Extraction
Query Initialization
Decoder
Prompt Queries Generation Module
Task-aware Adapter
Tracking Queries
Optimization
Synthetic Data
Data Source and Preprocessing
...and 13 more sections

Figures (12)

Figure 1: Fig. (a) and (b) are two types of cross-domain text spotting, including image-to-image and image-to-video. TT represents the TotalText. IC15 represents the ICDAR2015 karatzas2015icdar. IC13 represent the ICDAR2013 video karatzas2013icdar. TT represents the TotalText ch2019total.
Figure 2: Applying static text spotting methods (TESTR, results shown in the image) to videos, even those with minimal motion, leads to poor performance in both bounding box recall and recognition accuracy.
Figure 3: Overall framework of the VimTS. The image features are extracted in the feature extraction process. Then, the Query Initialization is used to generate the task-aware queries including detection and recognition queries. The task-aware queries are sent to the task-aware decoder to obtain the detection and recognition results simultaneously. The red arrow means performing word-level text spotting. The blue arrow means performing line-level text spotting. After the model is pre-trained, we freeze most of its parameters and train only the Task-ware Adapter and Prompt Queries Generation Module to convert the original single-task model into a multi-task model.
Figure 4: Illustration of the Prompt Queries Generation Module. Prompt queries for different tasks exchange information in the prompt queries generation module. $h$ is the number of parallel attention heads.
Figure 5: The overall structure of the task-aware adapter. The Adapter-1 is used to aggregate detection information and learn temporal information. The Adapter-2 is used to aggregate recognition information and learn temporal information.
...and 7 more figures

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

TL;DR

Abstract

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (12)