Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs

Jinmin Li; Kuofeng Gao; Yang Bai; Jingyun Zhang; Shu-Tao Xia

Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs

Jinmin Li, Kuofeng Gao, Yang Bai, Jingyun Zhang, Shu-Tao Xia

TL;DR

The paper addresses the risk of unauthorized video annotations by video-based LLMs and introduces Flow-based Video Watermarking, which applies imperceptible perturbations $\\Delta$ on a sparse set of frames selected via a flow-based mask $\\mathbf{M}_f$, guided by multi-modal losses to preserve viewing while degrading LLM comprehension. It jointly optimizes video-feature consistency $\\ell_{video}$ and LLM hidden-state consistency $\\ell_{LLM}$ under a constraint on $\\Delta$ using a flow-aware objective that emphasizes key frames. The approach demonstrates that watermarks on less than 20% of frames can significantly reduce CLIP scores, BLEU/ROUGE-CIDEr metrics, and GPT-3.5/4 accuracies across ActivityNet-200 and MSVD-QA, outperforming baseline perturbations and transferring to black-box settings. This work provides a practical defense for video data privacy in multi-modal AI, with implications for safeguarding content against misuse by video-based LLMs.

Abstract

The advent of video-based Large Language Models (LLMs) has significantly enhanced video understanding. However, it has also raised some safety concerns regarding data protection, as videos can be more easily annotated, even without authorization. This paper introduces Video Watermarking, a novel technique to protect videos from unauthorized annotations by such video-based LLMs, especially concerning the video content and description, in response to specific queries. By imperceptibly embedding watermarks into key video frames with multi-modal flow-based losses, our method preserves the viewing experience while preventing misuse by video-based LLMs. Extensive experiments show that Video Watermarking significantly reduces the comprehensibility of videos with various video-based LLMs, demonstrating both stealth and robustness. In essence, our method provides a solution for securing video content, ensuring its integrity and confidentiality in the face of evolving video-based LLMs technologies.

Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs

TL;DR

The paper addresses the risk of unauthorized video annotations by video-based LLMs and introduces Flow-based Video Watermarking, which applies imperceptible perturbations

on a sparse set of frames selected via a flow-based mask

, guided by multi-modal losses to preserve viewing while degrading LLM comprehension. It jointly optimizes video-feature consistency

and LLM hidden-state consistency

under a constraint on

using a flow-aware objective that emphasizes key frames. The approach demonstrates that watermarks on less than 20% of frames can significantly reduce CLIP scores, BLEU/ROUGE-CIDEr metrics, and GPT-3.5/4 accuracies across ActivityNet-200 and MSVD-QA, outperforming baseline perturbations and transferring to black-box settings. This work provides a practical defense for video data privacy in multi-modal AI, with implications for safeguarding content against misuse by video-based LLMs.

Abstract

Paper Structure (19 sections, 6 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 19 sections, 6 equations, 7 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Video-based Large Language Models
Adversarial Attack
Methodology
Threat model
Preliminary: the Pipeline of Video-based LLMs
Problem Formulation
Optimization Objective
Experiments
Implementation Details
Main Results
Discussions
Ablation Studies
Limitation
...and 4 more sections

Figures (7)

Figure 1: Schematics of our Video Watermarking.
Figure 2: Watermarking videos generated for Video-ChatGPT.
Figure 3: Relationship between optical flow and key frames. 'Clip Score of Adjacent Frames' describes the similarity between the current frame and its adjacent frames, the smaller this score is the more different the current frame is. 'Clip Score of Answer and Current Frame' indicates the similarity between the current frame and the answer corresponding to the user's input question, the larger the score indicates that the current frame contains more information about the answer. The frames selected by flow-based masks in our Video Watermarking are key frames in the video.
Figure 4: Transfer-based black-box watermarking on VideoChat.
Figure 5: Relationship between optical flow and key frames. 'Clip Score of Adjacent Frames' describes the similarity between the current frame and its adjacent frames, the smaller this score is the more different the current frame is. The frames selected by flow-based masks in our Video Watermarking are key frames in the video.
...and 2 more figures

Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs

TL;DR

Abstract

Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)