Table of Contents
Fetching ...

Sim-to-Real: An Unsupervised Noise Layer for Screen-Camera Watermarking Robustness

Yufeng Wu, Xin Liao, Baowei Wang, Han Fang, Xiaoshuai Wu, Mingyue Chen, Guiling Wang

TL;DR

This work tackles unauthorized screen-captured watermark leakage by enhancing SC-resistant watermarking robustness. It introduces Simulation-to-Real (S2R), an unsupervised noise layer that bridges simulated SC noise and real SC noise via a two-stage process: a mathematical model $T$ to create a certain-domain noise and an unpaired image-to-image map $G$ to align this noise to the real domain, yielding $y^u = G(T(x^s))$ and $F_\mathcal{U}(\cdot) = T * G$. A theoretical feasibility proof shows that the complex SC noise distribution can be decomposed into a multiplicative/additive form and approximated by a learned bias $k_\delta, n_\delta$, simplifying the distribution alignment task. The framework combines a differentiable noise model with adversarial and perceptual losses to train $G$, enabling robust, content-preserving noise refinement without paired data. Experimental results demonstrate that S2R outperforms state-of-the-art methods in watermark robustness and image quality across diverse devices, distances, and viewpoints, and offers scalable plug-and-play integration with different noise models and resolutions. This approach provides a practical, generalizable path toward real-world SC watermarking protections with reduced data requirements and improved generalization.

Abstract

Unauthorized screen capturing and dissemination pose severe security threats such as data leakage and information theft. Several studies propose robust watermarking methods to track the copyright of Screen-Camera (SC) images, facilitating post-hoc certification against infringement. These techniques typically employ heuristic mathematical modeling or supervised neural network fitting as the noise layer, to enhance watermarking robustness against SC. However, both strategies cannot fundamentally achieve an effective approximation of SC noise. Mathematical simulation suffers from biased approximations due to the incomplete decomposition of the noise and the absence of interdependence among the noise components. Supervised networks require paired data to train the noise-fitting model, and it is difficult for the model to learn all the features of the noise. To address the above issues, we propose Simulation-to-Real (S2R). Specifically, an unsupervised noise layer employs unpaired data to learn the discrepancy between the modeled simulated noise distribution and the real-world SC noise distribution, rather than directly learning the mapping from sharp images to real-world images. Learning this transformation from simulation to reality is inherently simpler, as it primarily involves bridging the gap in noise distributions, instead of the complex task of reconstructing fine-grained image details. Extensive experimental results validate the efficacy of the proposed method, demonstrating superior watermark robustness and generalization compared to state-of-the-art methods.

Sim-to-Real: An Unsupervised Noise Layer for Screen-Camera Watermarking Robustness

TL;DR

This work tackles unauthorized screen-captured watermark leakage by enhancing SC-resistant watermarking robustness. It introduces Simulation-to-Real (S2R), an unsupervised noise layer that bridges simulated SC noise and real SC noise via a two-stage process: a mathematical model to create a certain-domain noise and an unpaired image-to-image map to align this noise to the real domain, yielding and . A theoretical feasibility proof shows that the complex SC noise distribution can be decomposed into a multiplicative/additive form and approximated by a learned bias , simplifying the distribution alignment task. The framework combines a differentiable noise model with adversarial and perceptual losses to train , enabling robust, content-preserving noise refinement without paired data. Experimental results demonstrate that S2R outperforms state-of-the-art methods in watermark robustness and image quality across diverse devices, distances, and viewpoints, and offers scalable plug-and-play integration with different noise models and resolutions. This approach provides a practical, generalizable path toward real-world SC watermarking protections with reduced data requirements and improved generalization.

Abstract

Unauthorized screen capturing and dissemination pose severe security threats such as data leakage and information theft. Several studies propose robust watermarking methods to track the copyright of Screen-Camera (SC) images, facilitating post-hoc certification against infringement. These techniques typically employ heuristic mathematical modeling or supervised neural network fitting as the noise layer, to enhance watermarking robustness against SC. However, both strategies cannot fundamentally achieve an effective approximation of SC noise. Mathematical simulation suffers from biased approximations due to the incomplete decomposition of the noise and the absence of interdependence among the noise components. Supervised networks require paired data to train the noise-fitting model, and it is difficult for the model to learn all the features of the noise. To address the above issues, we propose Simulation-to-Real (S2R). Specifically, an unsupervised noise layer employs unpaired data to learn the discrepancy between the modeled simulated noise distribution and the real-world SC noise distribution, rather than directly learning the mapping from sharp images to real-world images. Learning this transformation from simulation to reality is inherently simpler, as it primarily involves bridging the gap in noise distributions, instead of the complex task of reconstructing fine-grained image details. Extensive experimental results validate the efficacy of the proposed method, demonstrating superior watermark robustness and generalization compared to state-of-the-art methods.

Paper Structure

This paper contains 44 sections, 18 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of traditional noise approximation strategies and proposed Simulation-to-Real. (a) Mathematical modeling-based noise approximation. (b) Supervised neural fitting via paired images. (c) Our method: transforming sharp images to a certain noise domain, then mapping them unsupervised to an unknown domain, achieving more realistic noise approximation.
  • Figure 2: Overview of the proposed S2R. In the training phase, given a set of sharp images $x^s$ and real SC images $y^u$, the sharp images are first transformed into images with a certain simulation noise distribution $y^c$ using a pre-defined mathematical modeling transformation $T$. Through unsupervised training, the Image-to-Image Network $G$ gradually adjusts $y^c$ to match the distribution of $y^u$, ultimately outputting the approximate images $y_{\text{out}}^u$. In the validation phase, given sharp images $x^s$, after passing through the transformation $T$ and the fixed-weight network $G$, the outputs are $y_{\text{out}}^u$.
  • Figure 3: Approximation results of different methods converting sharp images into noise images: (a) sharp images; (b) real SC images; (c) StegaStamp tancik2020stegastamp; (d) PIMoG fang2022pimog; (e) SSDS li2024screen; (f) our S2R.
  • Figure 4: Visual quality and robustness comparison of watermarking methods: (a) Original, (b) StegaStamp tancik2020stegastamp, (c) PIMoG fang2022pimog, (d) SSDS li2024screen, (e) Proposed S2R (trained on SIM+LEA).
  • Figure 5: The overall framework of the modified MIMO-UNet. It consists of an encoder that receives downsampled noise images at different scales and a Gaussian noise map sampled from a random normal distribution as input. The Shallow Convolutional Module (SCM) initially extracts features, followed by an Encoder Block (EB) containing a Feature Attention Module (FAM) that progressively extracts multi-level features. Between the encoder and decoder, an Asymmetric Feature Fusion (AFF) module is introduced to effectively fuse features at different scales. During the decoding phase, the network uses a single decoder to generate multiple deblurred images at different scales via the Decoder Block (DB). On the right side of the figure, from top to bottom, are the structures of the SCM, Feature Attention Module and AFF.
  • ...and 4 more figures