Table of Contents
Fetching ...

Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations

Haitong Liu, Kuofeng Gao, Yang Bai, Jinmin Li, Jinxiao Shan, Tao Dai, Shu-Tao Xia

TL;DR

The paper addresses the privacy risk of unauthorized video annotations by video-based LLMs and proposes two imperceptible watermark families, Ramblings and Mutes, to disrupt downstream information leakage. Ramblings induce completely incorrect captions via feature- and logit-level perturbations, while Mutes bias EOS probabilities to produce shorter or NULL captions. Across three datasets and three models, the methods significantly degrade annotation quality and downstream text-to-video performance, demonstrating robust, transferable protection. This work provides a practical defensive paradigm for safeguarding personal video content against automated analysis and leakage, with broader implications for data privacy and model reuse.

Abstract

Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used to improve the performance of downstream tasks, such as text-to-video generation. To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named Ramblings and Mutes. Concretely, Ramblings aim to mislead video-based LLMs into generating inaccurate captions for the videos, thereby degrading the quality of video annotations through inconsistencies between video content and captions. Mutes, on the other hand, are designed to prompt video-based LLMs to produce exceptionally brief captions, lacking descriptive detail. Extensive experiments demonstrate that our video watermarking methods effectively protect video data by significantly reducing video annotation performance across various video-based LLMs, showcasing both stealthiness and robustness in protecting personal video content. Our code is available at https://github.com/ttthhl/Protecting_Your_Video_Content.

Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations

TL;DR

The paper addresses the privacy risk of unauthorized video annotations by video-based LLMs and proposes two imperceptible watermark families, Ramblings and Mutes, to disrupt downstream information leakage. Ramblings induce completely incorrect captions via feature- and logit-level perturbations, while Mutes bias EOS probabilities to produce shorter or NULL captions. Across three datasets and three models, the methods significantly degrade annotation quality and downstream text-to-video performance, demonstrating robust, transferable protection. This work provides a practical defensive paradigm for safeguarding personal video content against automated analysis and leakage, with broader implications for data privacy and model reuse.

Abstract

Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used to improve the performance of downstream tasks, such as text-to-video generation. To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named Ramblings and Mutes. Concretely, Ramblings aim to mislead video-based LLMs into generating inaccurate captions for the videos, thereby degrading the quality of video annotations through inconsistencies between video content and captions. Mutes, on the other hand, are designed to prompt video-based LLMs to produce exceptionally brief captions, lacking descriptive detail. Extensive experiments demonstrate that our video watermarking methods effectively protect video data by significantly reducing video annotation performance across various video-based LLMs, showcasing both stealthiness and robustness in protecting personal video content. Our code is available at https://github.com/ttthhl/Protecting_Your_Video_Content.

Paper Structure

This paper contains 33 sections, 5 equations, 16 figures, 17 tables, 4 algorithms.

Figures (16)

  • Figure 1: We propose two strategies to protect the video content: Ramblings and Mutes. The Ramblings approach misleads video-based large language models into generating inaccurate captions. Besides, Mutes prompt these models to produce shorter and even NULL captions, lacking descriptive detail. The answers in the figure are excerpts.
  • Figure 2: The pipeline of Ramblings and Mutes. While Rambling-F focuses on feature-level perturbations of the original content, Rambling-L increases auto-regressive loss to manipulate video-based LLMs into generating incorrect descriptions. Mutes, on the other hand, tend to increase the probability of the EOS token. Finally, these video-text pairs can influence downstream tasks, such as the capacity of text-to-video models fine-tuned on these data.
  • Figure 3: Comparison of CLIP score (left) and BLEU (right) on Rambling-F to evaluate different parameters' influence. When the parameter is too small, the CLIP score and BLEU are high, indicating poor protective performance. Meanwhile, the evaluation scores decrease under $\alpha=1, \beta=1$ compared to $\alpha=1, \beta=0$ or $\alpha=0, \beta=1$, suggesting that the combination of feature losses is effective and necessary. All the videos are annotated by Video-LLaMA on the OpenVid-1M dataset. $\alpha=0,\beta=0$ represents original video.
  • Figure 4: CLIP score about Ramblings on different perturbation magnitudes. When the perturbation magnitude is small, the CLIP score is higher, indicating poor protective performance for the videos. With the increase of perturbation magnitude, the protective performance of Ramblings is better.
  • Figure 5: The length of the textual output for Mutes varies across different perturbation magnitudes. When the perturbation magnitude is small, the text length is longer, indicating more information leakage from the videos. With the increase of perturbation magnitude, the information leakage of Mutes is less.
  • ...and 11 more figures