Table of Contents
Fetching ...

WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

Hui Zhang, Juntao Liu, Zongkai Liu, Liqiang Niu, Fandong Meng, Zuxuan Wu, Yu-Gang Jiang

Abstract

Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.

WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

Abstract

Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.
Paper Structure (38 sections, 7 equations, 14 figures, 3 tables)

This paper contains 38 sections, 7 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Left: WeEdit achieves precise manipulation of textual content within images across diverse editing operations (edited regions are highlighted with blue bounding boxes). Right: WeEdit achieves the best performance among all open-source models on both bilingual and multilingual benchmarks, surpassing most proprietary models and ranking second only to Nano Banana Pro.
  • Figure 2: Overview of the glyph-guided supervised fine-tuning stage. A VLM first predicts the content and layout of the target text to render a glyph image. The original image, instruction, and glyph image are then jointly processed by the MM-DiT block to generate the target image.
  • Figure 3: Overview of the RL stage. The model generates multiple candidate images, which are evaluated by four separate reward models targeting four dimensions. Each reward model leverages a Vision-Language Model to produce logit distributions over discrete scores, which are then converted to continuous expected values.
  • Figure 4: Overview of our data construction pipelines. Top: the structured pipeline converts a source image to HTML, extracts and edits text content via a VLM, and renders both source and target images through a headless browser, yielding pixel-perfect editing pairs. Bottom: the unstructured pipeline uses a VLM to propose editing instructions, executes edits with a generative model, and iteratively verifies quality until all acceptance criteria are met.
  • Figure 5: Statistics of WeEdit Dataset: (a) Distribution over the seven editing operation types. (b) Language distribution across 15 supported languages. (c) Distribution of the number of edited regions per sample. (d) Distribution of the total edited text length (in characters) per sample.
  • ...and 9 more figures