Table of Contents
Fetching ...

DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting

Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, Dacheng Tao

TL;DR

DeepSolo++ presents a DETR-like, single-decoder framework that unifies text detection, recognition, and script identification through explicit point queries derived from Bezier center curves. By sampling $N$ on-curve points for $K$ top proposals and feeding composite queries into a lightweight predictor, the approach achieves end-to-end multilingual spotting with simple parallel heads and script-aware routing. The method introduces script-aware bipartite matching and leverages CTC-based transcript losses, demonstrating strong results on monolingual benchmarks and multilingual datasets (MLT17/19), while remaining compatible with line annotations. Overall, DeepSolo++ delivers a simple, efficient, and extensible solution that achieves state-of-the-art performance across dense and long text scenarios and diverse scripts, with promising transferability to weakly annotated data.

Abstract

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. Besides, they overlook the exploring on multilingual text spotting which requires an extra script identification task. In this paper, we present DeepSolo++, a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Furthermore, we show the surprisingly good extensibility of our method, in terms of character class, language type, and task. On the one hand, our method not only performs well in English scenes but also masters the transcription with complex font structure and a thousand-level character classes, such as Chinese. On the other hand, our DeepSolo++ achieves better performance on the additionally introduced script identification task with a simpler training pipeline compared with previous methods. In addition, our models are also compatible with line annotations, which require much less annotation cost than polygons. The code is available at \url{https://github.com/ViTAE-Transformer/DeepSolo}.

DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting

TL;DR

DeepSolo++ presents a DETR-like, single-decoder framework that unifies text detection, recognition, and script identification through explicit point queries derived from Bezier center curves. By sampling on-curve points for top proposals and feeding composite queries into a lightweight predictor, the approach achieves end-to-end multilingual spotting with simple parallel heads and script-aware routing. The method introduces script-aware bipartite matching and leverages CTC-based transcript losses, demonstrating strong results on monolingual benchmarks and multilingual datasets (MLT17/19), while remaining compatible with line annotations. Overall, DeepSolo++ delivers a simple, efficient, and extensible solution that achieves state-of-the-art performance across dense and long text scenarios and diverse scripts, with promising transferability to weakly annotated data.

Abstract

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. Besides, they overlook the exploring on multilingual text spotting which requires an extra script identification task. In this paper, we present DeepSolo++, a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Furthermore, we show the surprisingly good extensibility of our method, in terms of character class, language type, and task. On the one hand, our method not only performs well in English scenes but also masters the transcription with complex font structure and a thousand-level character classes, such as Chinese. On the other hand, our DeepSolo++ achieves better performance on the additionally introduced script identification task with a simpler training pipeline compared with previous methods. In addition, our models are also compatible with line annotations, which require much less annotation cost than polygons. The code is available at \url{https://github.com/ViTAE-Transformer/DeepSolo}.
Paper Structure (34 sections, 13 equations, 13 figures, 21 tables)

This paper contains 34 sections, 13 equations, 13 figures, 21 tables.

Figures (13)

  • Figure 1: Comparison of text spotting pipelines and query designs. In our method, for both monolingual and multilingual text spotting, the spotting part is a solo by the Transformer decoder with explicit points. 'TrEnc.' ('TrDec.'): Transformer encoder (decoder). 'Char.': characters. 'Seg.': segmentation. 'Lan. Predictor': language prediction network.
  • Figure 2: The architecture of DeepSolo++. We propose an explicit query form based on the points sampled from the Bezier center curve representation of text, solving multilingual text spotting with a single decoder and simple prediction heads in a concise framework.
  • Figure 3: The illustration of query modeling (top) in DeepSolo++ and the pipeline of training and inference (bottom). For ease of illustration, only the queries for one text instance are plotted, and only the linear layers for script identification and character classification are shown in the bottom section.
  • Figure 4: The illustration of script-aware matching.
  • Figure 5: Comparison with open-sourced Transformer-based methods using only Total-Text training set.
  • ...and 8 more figures