Table of Contents
Fetching ...

Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis

Zeqin Yu, Haotao Xie, Jian Zhang, Jiangqun Ni, Wenkan Su, Jiwu Huang

TL;DR

The paper addresses the poor real-world generalization of text image forgery localization (T-IFL) models by modeling the invisible, high-dimensional tampering parameters that underlie real-world forgeries. It introduces Fourier Series-based Tampering Synthesis (FSTS), a hierarchical, interpretable framework that collects 16,750 real-world tampering traces from 67 experts, identifies recurring individual and population patterns, and represents tampering distributions as basis configurations with learned weights. By sampling these coefficients and configurations, FSTS synthesizes diverse and realistic tampered images that better reflect real forgery traces, improving cross-domain generalization on real-world datasets. Extensive experiments across four evaluation protocols show that FSTS-trained models consistently outperform baselines trained on conventional synthetic data, highlighting the practical impact of incorporating real-world tampering distributions into synthetic data generation. This approach offers a principled path toward robust, scalable T-IFL systems capable of handling unseen tampering scenarios.

Abstract

Existing Text Image Forgery Localization (T-IFL) methods often suffer from poor generalization due to the limited scale of real-world datasets and the distribution gap caused by synthetic data that fails to capture the complexity of real-world tampering. To tackle this issue, we propose Fourier Series-based Tampering Synthesis (FSTS), a structured and interpretable framework for synthesizing tampered text images. FSTS first collects 16,750 real-world tampering instances from five representative tampering types, using a structured pipeline that records human-performed editing traces via multi-format logs (e.g., video, PSD, and editing logs). By analyzing these collected parameters and identifying recurring behavioral patterns at both individual and population levels, we formulate a hierarchical modeling framework. Specifically, each individual tampering parameter is represented as a compact combination of basis operation-parameter configurations, while the population-level distribution is constructed by aggregating these behaviors. Since this formulation draws inspiration from the Fourier series, it enables an interpretable approximation using basis functions and their learned weights. By sampling from this modeled distribution, FSTS synthesizes diverse and realistic training data that better reflect real-world forgery traces. Extensive experiments across four evaluation protocols demonstrate that models trained with FSTS data achieve significantly improved generalization on real-world datasets. Dataset is available at \href{https://github.com/ZeqinYu/FSTS}{Project Page}.

Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis

TL;DR

The paper addresses the poor real-world generalization of text image forgery localization (T-IFL) models by modeling the invisible, high-dimensional tampering parameters that underlie real-world forgeries. It introduces Fourier Series-based Tampering Synthesis (FSTS), a hierarchical, interpretable framework that collects 16,750 real-world tampering traces from 67 experts, identifies recurring individual and population patterns, and represents tampering distributions as basis configurations with learned weights. By sampling these coefficients and configurations, FSTS synthesizes diverse and realistic tampered images that better reflect real forgery traces, improving cross-domain generalization on real-world datasets. Extensive experiments across four evaluation protocols show that FSTS-trained models consistently outperform baselines trained on conventional synthetic data, highlighting the practical impact of incorporating real-world tampering distributions into synthetic data generation. This approach offers a principled path toward robust, scalable T-IFL systems capable of handling unseen tampering scenarios.

Abstract

Existing Text Image Forgery Localization (T-IFL) methods often suffer from poor generalization due to the limited scale of real-world datasets and the distribution gap caused by synthetic data that fails to capture the complexity of real-world tampering. To tackle this issue, we propose Fourier Series-based Tampering Synthesis (FSTS), a structured and interpretable framework for synthesizing tampered text images. FSTS first collects 16,750 real-world tampering instances from five representative tampering types, using a structured pipeline that records human-performed editing traces via multi-format logs (e.g., video, PSD, and editing logs). By analyzing these collected parameters and identifying recurring behavioral patterns at both individual and population levels, we formulate a hierarchical modeling framework. Specifically, each individual tampering parameter is represented as a compact combination of basis operation-parameter configurations, while the population-level distribution is constructed by aggregating these behaviors. Since this formulation draws inspiration from the Fourier series, it enables an interpretable approximation using basis functions and their learned weights. By sampling from this modeled distribution, FSTS synthesizes diverse and realistic training data that better reflect real-world forgery traces. Extensive experiments across four evaluation protocols demonstrate that models trained with FSTS data achieve significantly improved generalization on real-world datasets. Dataset is available at \href{https://github.com/ZeqinYu/FSTS}{Project Page}.

Paper Structure

This paper contains 34 sections, 8 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Visible vs. Invisible distributions in synthetic tampered text image datasets. Existing datasets mainly focus on visible attributes (a–d), while our FSTS strategy models invisible tampering parameters (e–g) derived from real-world tampering scenarios.
  • Figure 2: Parameter usage frequencies in "Replacement" tampering samples across three tamperers and the overall average.
  • Figure 3: Overview of the proposed FSTS framework. (a) Inspiration: A rectangular signal $s(x)$ is approximated by a weighted sum of sinusoidal basis functions, i.e., $s_{N}(x) = \sum_{k=1}^N \frac{4}{(2k-1)\pi}\sin((2k-1)x)$, where $\sin\!\bigl((2k-1)x\bigr)$ is a basis function and $\sum_{k=1}^N \frac{4}{(2k-1)\pi}$ is its weight. Larger $N$ yields higher-fidelity reconstruction, illustrating the idea of decomposition and recombination over a quasi-periodic domain. (b) Modeling: Each individual distribution $P_S^{(i)}(t)$ is modeled as a weighted combination of basis tampering configurations (individual-level reconstruction), and their aggregation $P_S(t)$ approximates $P_R(t)$ (population-level reconstruction). (c) Generation: Based on the learned basis functions and weights from (b), parameter configurations are sampled and applied to text images, yielding synthetic tampered images that more accurately reflect real-world forgery traces.
  • Figure 4: Synthetic Tampered Text Image Generation Pipeline with Parameters Modeled by FSTS. (I) The overall pipeline takes a target image and synthesizes a tampered version along with its corresponding ground-truth mask, using tampering types, parameter configurations, and frequency weights modeled by our proposed FSTS framework. (II) This panel zooms into the tampering step in (I), detailing the main and post-processing operations for five representative tampering types. We use consistent color coding to distinguish different tampering types in both the ground-truth mask (II)(d) and the operation detail panels (II)(1-5): (1) Copy-move (copy and move text within the same image), (2) Splicing (pastes text from a source image to a target image), (3) Removal (erases text followed by in-painting), (4) Insertion (inserts forged text into blank regions), (5) Replacement (generates forged text to replace original text).
  • Figure 5: Comparison of tampering operation diversity between existing synthetic datasets and real-world forgery samples. (1)–(4) show examples from four synthetic datasets: PS-scripted zhuang2021image, DocTamper qu2023towards, TIC13 wang2022detecting, and T-SROIE wang2022tampered. In each sample, the forged region is highlighted in red. PS-scripted uses real-world tampering parameters but randomly assigns tampering targets, lacking representative coverage of tampering types. The others are generated using deep generative methods, which often apply similar operations and parameters across samples, reflecting limited diversity in invisible distributions. In contrast, (5)–(8) visualize four replacement samples collected from real-world tampered data. Each case reflects a distinct combination of tampering operation-parameters (e.g., region sampling, insertion, shadow, blur), illustrating the diversity and complexity inherent in real-world tampering behaviors. This comparison highlights the importance of modeling invisible parameter distributions to improve the diversity and realism of synthetic data.