Table of Contents
Fetching ...

GreatSplicing: A Semantically Rich Splicing Dataset

Jiaming Liang, Yuwan Xue, Haowei Liu, Zhenqi Dai, Yu Liao, Rui Wang, Weihao Jiang, Yaping Liu, Zhikun Chen, Guoxiao Liu, Bo Liu, Xiuli Bi

TL;DR

GreatSplicing addresses the semantic diversity shortcoming in splicing forgery datasets by manually constructing a large, high-resolution dataset derived from BossBase, containing 5,000 spliced images across 335 semantic categories. The authors implement a controlled Photoshop-based pipeline to preserve genuine splicing traces and produce precise ground-truth masks, enabling detectors to focus on splicing cues rather than semantic content. Evaluations across multiple baselines show reduced authentic-image misidentification and improved cross-dataset generalization, particularly for segmentation-capable models like U-Net and RRU-Net. This dataset and its standardized production protocol improve reproducibility and benchmarking for splicing forgery detection and offer a realistic, semantically rich resource for future research.

Abstract

In existing splicing forgery datasets, the insufficient semantic variety of spliced regions causes trained detection models to overfit semantic features rather than learn genuine splicing traces. Meanwhile, the lack of a reasonable benchmark dataset has led to inconsistent experimental settings across existing detection methods. To address these issues, we propose GreatSplicing, a manually created, large-scale, high-quality splicing dataset. GreatSplicing comprises 5,000 spliced images and covers spliced regions across 335 distinct semantic categories, enabling detection models to learn splicing traces more effectively. Empirical results show that detection models trained on GreatSplicing achieve low misidentification rates and stronger cross-dataset generalization compared to existing datasets. GreatSplicing is now publicly available for research purposes at the following link.

GreatSplicing: A Semantically Rich Splicing Dataset

TL;DR

GreatSplicing addresses the semantic diversity shortcoming in splicing forgery datasets by manually constructing a large, high-resolution dataset derived from BossBase, containing 5,000 spliced images across 335 semantic categories. The authors implement a controlled Photoshop-based pipeline to preserve genuine splicing traces and produce precise ground-truth masks, enabling detectors to focus on splicing cues rather than semantic content. Evaluations across multiple baselines show reduced authentic-image misidentification and improved cross-dataset generalization, particularly for segmentation-capable models like U-Net and RRU-Net. This dataset and its standardized production protocol improve reproducibility and benchmarking for splicing forgery detection and offer a realistic, semantically rich resource for future research.

Abstract

In existing splicing forgery datasets, the insufficient semantic variety of spliced regions causes trained detection models to overfit semantic features rather than learn genuine splicing traces. Meanwhile, the lack of a reasonable benchmark dataset has led to inconsistent experimental settings across existing detection methods. To address these issues, we propose GreatSplicing, a manually created, large-scale, high-quality splicing dataset. GreatSplicing comprises 5,000 spliced images and covers spliced regions across 335 distinct semantic categories, enabling detection models to learn splicing traces more effectively. Empirical results show that detection models trained on GreatSplicing achieve low misidentification rates and stronger cross-dataset generalization compared to existing datasets. GreatSplicing is now publicly available for research purposes at the following link.
Paper Structure (13 sections, 4 equations, 6 figures, 5 tables)

This paper contains 13 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The distinctions between GreatSplicing and existing splicing datasets. The spliced regions in GreatSplicing exhibit a wider variety of semantic categories, which enhances the network's learning of splicing traces.
  • Figure 2: Examples of limitations in existing splicing datasets.
  • Figure 3: GreatSplicing comprises 335 distinct semantic categories for its spliced regions. In the distribution diagram, the horizontal axis represents the index of each semantic category, while the vertical axis represents the amount of spliced images corresponding to each semantic category, arranged in descending order.
  • Figure 4: Selected spliced images from GreatSplicing. GreatSplicing exhibits significant advantages, such as semantic richness, high verisimilitude, high resolution, etc.
  • Figure 5: Results of authentic image misidentification. Twelve authentic color images were sourced from CASIA, FantasticReality, and BossBase. These images were tested with models trained on CASIA, FantasticReality, and GreatSplicing datasets respectively, resulting in misidentification outcomes.
  • ...and 1 more figures