Table of Contents
Fetching ...

Camera Model Identification with SPAIR-Swin and Entropy based Non-Homogeneous Patches

Protyay Dey, Rejoy Chakraborty, Abhilasha S. Jadhav, Kapil Rana, Gaurav Sharma, Puneet Goyal

TL;DR

The paper tackles the challenge of source camera model identification (SCMI) by introducing SPAIR-Swin, a architecture that fuses a SPAIR block with the Swin Transformer to capture both global and local camera-specific artifacts. A novel entropy-based patch extraction strategy selects high-information regions likely to carry distinctive sensor noise and processing traces, improving discriminability. Empirical results on Dresden, Vision, Forchheim, and Socrates datasets show state-of-the-art image-level and patch-level accuracies, with substantial gains over multiple baselines and evidence that high-entropy patches are particularly informative. The approach offers a practical path for robust SCMI in forensic contexts, and code availability is noted upon request.

Abstract

Source camera model identification (SCMI) plays a pivotal role in image forensics with applications including authenticity verification and copyright protection. For identifying the camera model used to capture a given image, we propose SPAIR-Swin, a novel model combining a modified spatial attention mechanism and inverted residual block (SPAIR) with a Swin Transformer. SPAIR-Swin effectively captures both global and local features, enabling robust identification of artifacts such as noise patterns that are particularly effective for SCMI. Additionally, unlike conventional methods focusing on homogeneous patches, we propose a patch selection strategy for SCMI that emphasizes high-entropy regions rich in patterns and textures. Extensive evaluations on four benchmark SCMI datasets demonstrate that SPAIR-Swin outperforms existing methods, achieving patch-level accuracies of 99.45%, 98.39%, 99.45%, and 97.46% and image-level accuracies of 99.87%, 99.32%, 100%, and 98.61% on the Dresden, Vision, Forchheim, and Socrates datasets, respectively. Our findings highlight that high-entropy patches, which contain high-frequency information such as edge sharpness, noise, and compression artifacts, are more favorable in improving SCMI accuracy. Code will be made available upon request.

Camera Model Identification with SPAIR-Swin and Entropy based Non-Homogeneous Patches

TL;DR

The paper tackles the challenge of source camera model identification (SCMI) by introducing SPAIR-Swin, a architecture that fuses a SPAIR block with the Swin Transformer to capture both global and local camera-specific artifacts. A novel entropy-based patch extraction strategy selects high-information regions likely to carry distinctive sensor noise and processing traces, improving discriminability. Empirical results on Dresden, Vision, Forchheim, and Socrates datasets show state-of-the-art image-level and patch-level accuracies, with substantial gains over multiple baselines and evidence that high-entropy patches are particularly informative. The approach offers a practical path for robust SCMI in forensic contexts, and code availability is noted upon request.

Abstract

Source camera model identification (SCMI) plays a pivotal role in image forensics with applications including authenticity verification and copyright protection. For identifying the camera model used to capture a given image, we propose SPAIR-Swin, a novel model combining a modified spatial attention mechanism and inverted residual block (SPAIR) with a Swin Transformer. SPAIR-Swin effectively captures both global and local features, enabling robust identification of artifacts such as noise patterns that are particularly effective for SCMI. Additionally, unlike conventional methods focusing on homogeneous patches, we propose a patch selection strategy for SCMI that emphasizes high-entropy regions rich in patterns and textures. Extensive evaluations on four benchmark SCMI datasets demonstrate that SPAIR-Swin outperforms existing methods, achieving patch-level accuracies of 99.45%, 98.39%, 99.45%, and 97.46% and image-level accuracies of 99.87%, 99.32%, 100%, and 98.61% on the Dresden, Vision, Forchheim, and Socrates datasets, respectively. Our findings highlight that high-entropy patches, which contain high-frequency information such as edge sharpness, noise, and compression artifacts, are more favorable in improving SCMI accuracy. Code will be made available upon request.

Paper Structure

This paper contains 14 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Proposed Pipeline for SPAIR-Swin Architecture. The process begins by extracting patches from the input image. These patches are then processed by the SPAIR block, which generates Feature-Enhanced Patches that capture prominent visual cues. The enhanced patches are subsequently fed into the Swin Transformer for the final classification.
  • Figure 2: Entropy-based patch extraction. The input images are first center-cropped, then processed to get non-overlapping patches of size $k \times k$. For each patch, entropy is computed based on its grayscale pixel intensity distribution. The patches are ranked in descending order of entropy and the top $\mathbf{P}$ patches are selected for the purpose of SCMI.
  • Figure 3: The schematic diagram of the SPAIR module. The proposed feature extractor incorporates an Inverted Residual Block with an expansion factor of $4$ and a modified spatial attention module. The Inverted Residual Block makes use of depthwise convolution coupled with elementwise additive skip connection, while the modified spatial attention module uses average pooling with an adaptive mechanism to improve feature representation. ReLU and batch normalization are added after convolutional and depthwise convolutional layers except for the last layer that employs a convolutional layer with a sigmoid function. s: stride, p: padding, f: number of filters, m: kernel size or mask size, e: expansion factor, $\mathbf{\otimes}$: elementwise multiplication, $\mathbf{\oplus}$: elementwise addition.
  • Figure 4: Swin Transformer Architecture. The components of the Swin Transformer, including the LayerNorm (LN) layer, MultiLayer Perceptron (MLP), and Multi-Head Self-Attention modules (W-MSA and SW-MSA), with regular and shifted windowing configurations, respectively.
  • Figure 5: Comparison of F1 Scores achieved by various methods on the four datasets. Across all datasets, the proposed method has outperformed others and achieved almost perfect F1 scores in most cases, indicating its generalizability across different datasets. Previous state-of-the-art methods displayed some inconsistencies in their performance, where few of them experiencing drastic reductions in their F1 scores.