Table of Contents
Fetching ...

WAS: Dataset and Methods for Artistic Text Segmentation

Xudong Xie, Yuzhe Li, Yang Liu, Zhifei Zhang, Zhaowen Wang, Wei Xiong, Xiang Bai

TL;DR

This work tackles artistic text segmentation, a challenging task due to highly variable local stroke shapes and complex global topology. It introduces WAS-R (real) and WAS-S (synthetic) datasets and presents WASNet, a Mask2Former–based framework augmented with Layer-wise Momentum Query (LMQ) and a skeleton-assisted head to capture both local and global structure. A synthetic data pipeline leveraging the Monkey multimodal model and ControlNet enables large-scale mask-aligned image generation, delivering significant performance gains and strong cross-dataset generalization without extensive fine-tuning. The approach establishes a new benchmark and an adaptable experimental paradigm for artistic text segmentation with practical implications for text-related generation and editing tasks.

Abstract

Accurate text segmentation results are crucial for text-related generative tasks, such as text image generation, text editing, text removal, and text style transfer. Recently, some scene text segmentation methods have made significant progress in segmenting regular text. However, these methods perform poorly in scenarios containing artistic text. Therefore, this paper focuses on the more challenging task of artistic text segmentation and constructs a real artistic text segmentation dataset. One challenge of the task is that the local stroke shapes of artistic text are changeable with diversity and complexity. We propose a decoder with the layer-wise momentum query to prevent the model from ignoring stroke regions of special shapes. Another challenge is the complexity of the global topological structure. We further design a skeleton-assisted head to guide the model to focus on the global structure. Additionally, to enhance the generalization performance of the text segmentation model, we propose a strategy for training data synthesis, based on the large multi-modal model and the diffusion model. Experimental results show that our proposed method and synthetic dataset can significantly enhance the performance of artistic text segmentation and achieve state-of-the-art results on other public datasets.

WAS: Dataset and Methods for Artistic Text Segmentation

TL;DR

This work tackles artistic text segmentation, a challenging task due to highly variable local stroke shapes and complex global topology. It introduces WAS-R (real) and WAS-S (synthetic) datasets and presents WASNet, a Mask2Former–based framework augmented with Layer-wise Momentum Query (LMQ) and a skeleton-assisted head to capture both local and global structure. A synthetic data pipeline leveraging the Monkey multimodal model and ControlNet enables large-scale mask-aligned image generation, delivering significant performance gains and strong cross-dataset generalization without extensive fine-tuning. The approach establishes a new benchmark and an adaptable experimental paradigm for artistic text segmentation with practical implications for text-related generation and editing tasks.

Abstract

Accurate text segmentation results are crucial for text-related generative tasks, such as text image generation, text editing, text removal, and text style transfer. Recently, some scene text segmentation methods have made significant progress in segmenting regular text. However, these methods perform poorly in scenarios containing artistic text. Therefore, this paper focuses on the more challenging task of artistic text segmentation and constructs a real artistic text segmentation dataset. One challenge of the task is that the local stroke shapes of artistic text are changeable with diversity and complexity. We propose a decoder with the layer-wise momentum query to prevent the model from ignoring stroke regions of special shapes. Another challenge is the complexity of the global topological structure. We further design a skeleton-assisted head to guide the model to focus on the global structure. Additionally, to enhance the generalization performance of the text segmentation model, we propose a strategy for training data synthesis, based on the large multi-modal model and the diffusion model. Experimental results show that our proposed method and synthetic dataset can significantly enhance the performance of artistic text segmentation and achieve state-of-the-art results on other public datasets.
Paper Structure (28 sections, 4 equations, 9 figures, 3 tables)

This paper contains 28 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Examples of images and annotations from the proposed WAS-R dataset.
  • Figure 2: (a) Training pipeline of ControlNet. (b) WAS-S data generation pipeline.
  • Figure 3: The generated <mask, prompt, image> triplet. The left column is the generated masks. The middle column shows the prompt generated by GPT-4, imitating styles of the prompt in the training set. The right column is the final generated images.
  • Figure 4: Up: The overall architecture of our WASNet. Down: The Transformer decoder with layer-wise momentum query (LMQ).
  • Figure 5: Qualitative comparison between the baseline model Mask2Former cheng2021per and our WASNet. The two innovations of our method alleviate the two main problems of artistic text segmentation respectively.
  • ...and 4 more figures