Table of Contents
Fetching ...

WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

Utae Jeong, Sumin In, Hyunju Ryu, Jaewan Choi, Feng Yang, Jongheon Jeong, Seungryong Kim, Sangpil Kim

TL;DR

WaTeRFlow proposes a robust watermarking framework for image-to-video scenarios by embedding watermarks with a flow-aware encoder, training under a Flow-guided Unified Synthesis Engine (FUSE) that includes image-editing and fast video diffusion proxies, and stabilizing per-frame detection with a Temporal Consistency Loss and semantic preservation. The method achieves higher first-frame and per-frame bit accuracy across two representative I2V models (SVD-XT and CogVideoX) and maintains perceptual quality under diverse pre- and post-I2V distortions. Key contributions include end-to-end optimization with FUSE, flow-based frame alignment, and semantic-aware embedding, enabling practical provenance verification in real-world video generation contexts.

Abstract

Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.

WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

TL;DR

WaTeRFlow proposes a robust watermarking framework for image-to-video scenarios by embedding watermarks with a flow-aware encoder, training under a Flow-guided Unified Synthesis Engine (FUSE) that includes image-editing and fast video diffusion proxies, and stabilizing per-frame detection with a Temporal Consistency Loss and semantic preservation. The method achieves higher first-frame and per-frame bit accuracy across two representative I2V models (SVD-XT and CogVideoX) and maintains perceptual quality under diverse pre- and post-I2V distortions. Key contributions include end-to-end optimization with FUSE, flow-based frame alignment, and semantic-aware embedding, enabling practical provenance verification in real-world video generation contexts.

Abstract

Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.

Paper Structure

This paper contains 13 sections, 15 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Application scenario of WaTeRFlow. Specifically, we consider watermark embedding for image copyright protection and verification from generated videos. Left: An owner, Alice, protects her picture by embedding a watermark. Middle: An unauthorized user, Bob, generates a video from Alice’s picture. Right: On Alice’s side, copyright is verified by recovering the watermark from the video frames using her watermark decoder, in order to identify the source used to generate the video and verify whether her picture was used.
  • Figure 2: The overview of WaTeRFlow. Left: The watermark encoder is optimized to embed the watermark while preserving quality in both pixel and latent space, and it is trained to keep the watermarked image semantically close to the original. Middle: Image editing and video generation are performed by an image editing proxy and a video diffusion proxy, respectively, and the generated frames are then warped to the first frame. Right: The decoder processes the images produced by FUSE to decode the embedded watermark and compute the training loss. Overall, these components enable watermark insertion and detection that are robust to image-to-video generation.
  • Figure 3: Qualitative results. Top: The original image and the watermarked images for each watermarking method. Middle: From left to right, the 24-th frames generated using SVD-XT are shown for the original image, our method, and the baselines. Bottom: Frames generated by CogVideoX. From left to right, we present the 24-th frames from the videos generated from the original image, then our method, followed by the baselines. Our method shows the highest bit accuracy for both video generation models in the given frames.
  • Figure 4: Per-frame bit accuracy and I2V robustness. Each plot visualizes bit accuracy on the even-numbered frames after image-to-video (I2V) generation. Across two representative I2V models, our method achieves the highest average bit accuracy compared to the baselines. It also shows the strongest robustness in the image-to-video (I2V) generation following image editing ultra_edit.
  • Figure 5: Training stability. Computed on frames from FUSE’s video generation branch, comparing with and without an optical-flow estimator. The estimator yields a more stable trajectory, indicating improved robustness to image-to-video generation.