GreatSplicing: A Semantically Rich Splicing Dataset
Jiaming Liang, Yuwan Xue, Haowei Liu, Zhenqi Dai, Yu Liao, Rui Wang, Weihao Jiang, Yaping Liu, Zhikun Chen, Guoxiao Liu, Bo Liu, Xiuli Bi
TL;DR
GreatSplicing addresses the semantic diversity shortcoming in splicing forgery datasets by manually constructing a large, high-resolution dataset derived from BossBase, containing 5,000 spliced images across 335 semantic categories. The authors implement a controlled Photoshop-based pipeline to preserve genuine splicing traces and produce precise ground-truth masks, enabling detectors to focus on splicing cues rather than semantic content. Evaluations across multiple baselines show reduced authentic-image misidentification and improved cross-dataset generalization, particularly for segmentation-capable models like U-Net and RRU-Net. This dataset and its standardized production protocol improve reproducibility and benchmarking for splicing forgery detection and offer a realistic, semantically rich resource for future research.
Abstract
In existing splicing forgery datasets, the insufficient semantic variety of spliced regions causes trained detection models to overfit semantic features rather than learn genuine splicing traces. Meanwhile, the lack of a reasonable benchmark dataset has led to inconsistent experimental settings across existing detection methods. To address these issues, we propose GreatSplicing, a manually created, large-scale, high-quality splicing dataset. GreatSplicing comprises 5,000 spliced images and covers spliced regions across 335 distinct semantic categories, enabling detection models to learn splicing traces more effectively. Empirical results show that detection models trained on GreatSplicing achieve low misidentification rates and stronger cross-dataset generalization compared to existing datasets. GreatSplicing is now publicly available for research purposes at the following link.
