Table of Contents
Fetching ...

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu, Chunhua Shen, Lianwen Jin, Xiang Bai

TL;DR

VimTS tackles cross-domain text spotting by unifying word-level, line-level, and video-level tasks into a single framework. The approach introduces a Prompt Queries Generation Module and a Task-aware Adapter to enable multi-task learning with minimal parameter overhead, and leverages the VTD-368k synthetic video dataset generated with CoDeF to inject temporal dynamics. Empirical results show notable gains on image-to-image and image-to-video benchmarks, with average improvements of about 2.6% in H-mean and 5.5% in MOTA on video benchmarks, even when trained only on image-level data; additional gains are achieved with video data or VTD-368k. The work highlights the limits of large multimodal models for cross-domain scene text spotting and demonstrates that task-specific, low-resource approaches can achieve strong generalization.

Abstract

Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the https://VimTextSpotter.github.io.

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

TL;DR

VimTS tackles cross-domain text spotting by unifying word-level, line-level, and video-level tasks into a single framework. The approach introduces a Prompt Queries Generation Module and a Task-aware Adapter to enable multi-task learning with minimal parameter overhead, and leverages the VTD-368k synthetic video dataset generated with CoDeF to inject temporal dynamics. Empirical results show notable gains on image-to-image and image-to-video benchmarks, with average improvements of about 2.6% in H-mean and 5.5% in MOTA on video benchmarks, even when trained only on image-level data; additional gains are achieved with video data or VTD-368k. The work highlights the limits of large multimodal models for cross-domain scene text spotting and demonstrates that task-specific, low-resource approaches can achieve strong generalization.

Abstract

Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the https://VimTextSpotter.github.io.
Paper Structure (28 sections, 9 equations, 12 figures, 13 tables)

This paper contains 28 sections, 9 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: Fig. (a) and (b) are two types of cross-domain text spotting, including image-to-image and image-to-video. TT represents the TotalText. IC15 represents the ICDAR2015 karatzas2015icdar. IC13 represent the ICDAR2013 video karatzas2013icdar. TT represents the TotalText ch2019total.
  • Figure 2: Applying static text spotting methods (TESTR, results shown in the image) to videos, even those with minimal motion, leads to poor performance in both bounding box recall and recognition accuracy.
  • Figure 3: Overall framework of the VimTS. The image features are extracted in the feature extraction process. Then, the Query Initialization is used to generate the task-aware queries including detection and recognition queries. The task-aware queries are sent to the task-aware decoder to obtain the detection and recognition results simultaneously. The red arrow means performing word-level text spotting. The blue arrow means performing line-level text spotting. After the model is pre-trained, we freeze most of its parameters and train only the Task-ware Adapter and Prompt Queries Generation Module to convert the original single-task model into a multi-task model.
  • Figure 4: Illustration of the Prompt Queries Generation Module. Prompt queries for different tasks exchange information in the prompt queries generation module. $h$ is the number of parallel attention heads.
  • Figure 5: The overall structure of the task-aware adapter. The Adapter-1 is used to aggregate detection information and learn temporal information. The Adapter-2 is used to aggregate recognition information and learn temporal information.
  • ...and 7 more figures