Table of Contents
Fetching ...

Cross-Modal Transferable Image-to-Video Attack on Video Quality Metrics

Georgii Gotin, Ekaterina Shumitskaya, Anastasia Antsiferova, Dmitriy Vatolin

TL;DR

This work addresses the vulnerability of no-reference video quality assessment metrics to adversarial manipulation by proposing IC2VQA, a cross-modal transferable attack that launches white-box perturbations on IQA metrics augmented with CLIP and transfers them to black-box VQA models. The method introduces a cross-layer loss across IQA metric layers and a multi-modal cross-layer framework, complemented by a CLIP-based term and a temporal consistency constraint, to maximize transferability while keeping perturbations imperceptible. Experiments on a subset of the Xiph.org dataset across three VQA models show that IC2VQA consistently lowers PLCC and SRCC correlations more effectively than baselines (Square Attack, AttackVQA, and transferable PGD), with ablations revealing the additive value of CLIP features and temporal regularization. The findings highlight vulnerabilities in current VQA metrics and offer a path toward more robust evaluate-criteria, emphasizing cross-modal relationships between IQA and VQA features as a mechanism for transferability.

Abstract

Recent studies have revealed that modern image and video quality assessment (IQA/VQA) metrics are vulnerable to adversarial attacks. An attacker can manipulate a video through preprocessing to artificially increase its quality score according to a certain metric, despite no actual improvement in visual quality. Most of the attacks studied in the literature are white-box attacks, while black-box attacks in the context of VQA have received less attention. Moreover, some research indicates a lack of transferability of adversarial examples generated for one model to another when applied to VQA. In this paper, we propose a cross-modal attack method, IC2VQA, aimed at exploring the vulnerabilities of modern VQA models. This approach is motivated by the observation that the low-level feature spaces of images and videos are similar. We investigate the transferability of adversarial perturbations across different modalities; specifically, we analyze how adversarial perturbations generated on a white-box IQA model with an additional CLIP module can effectively target a VQA model. The addition of the CLIP module serves as a valuable aid in increasing transferability, as the CLIP model is known for its effective capture of low-level semantics. Extensive experiments demonstrate that IC2VQA achieves a high success rate in attacking three black-box VQA models. We compare our method with existing black-box attack strategies, highlighting its superiority in terms of attack success within the same number of iterations and levels of attack strength. We believe that the proposed method will contribute to the deeper analysis of robust VQA metrics.

Cross-Modal Transferable Image-to-Video Attack on Video Quality Metrics

TL;DR

This work addresses the vulnerability of no-reference video quality assessment metrics to adversarial manipulation by proposing IC2VQA, a cross-modal transferable attack that launches white-box perturbations on IQA metrics augmented with CLIP and transfers them to black-box VQA models. The method introduces a cross-layer loss across IQA metric layers and a multi-modal cross-layer framework, complemented by a CLIP-based term and a temporal consistency constraint, to maximize transferability while keeping perturbations imperceptible. Experiments on a subset of the Xiph.org dataset across three VQA models show that IC2VQA consistently lowers PLCC and SRCC correlations more effectively than baselines (Square Attack, AttackVQA, and transferable PGD), with ablations revealing the additive value of CLIP features and temporal regularization. The findings highlight vulnerabilities in current VQA metrics and offer a path toward more robust evaluate-criteria, emphasizing cross-modal relationships between IQA and VQA features as a mechanism for transferability.

Abstract

Recent studies have revealed that modern image and video quality assessment (IQA/VQA) metrics are vulnerable to adversarial attacks. An attacker can manipulate a video through preprocessing to artificially increase its quality score according to a certain metric, despite no actual improvement in visual quality. Most of the attacks studied in the literature are white-box attacks, while black-box attacks in the context of VQA have received less attention. Moreover, some research indicates a lack of transferability of adversarial examples generated for one model to another when applied to VQA. In this paper, we propose a cross-modal attack method, IC2VQA, aimed at exploring the vulnerabilities of modern VQA models. This approach is motivated by the observation that the low-level feature spaces of images and videos are similar. We investigate the transferability of adversarial perturbations across different modalities; specifically, we analyze how adversarial perturbations generated on a white-box IQA model with an additional CLIP module can effectively target a VQA model. The addition of the CLIP module serves as a valuable aid in increasing transferability, as the CLIP model is known for its effective capture of low-level semantics. Extensive experiments demonstrate that IC2VQA achieves a high success rate in attacking three black-box VQA models. We compare our method with existing black-box attack strategies, highlighting its superiority in terms of attack success within the same number of iterations and levels of attack strength. We believe that the proposed method will contribute to the deeper analysis of robust VQA metrics.
Paper Structure (21 sections, 5 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 5 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Scheme of the proposed IC2VQA method. Given an original video, each clip runs through image quality metric with saving of output on the k-th layer and through CLIP image model with saving full output. After that attacked video runs same models with saving same outputs. Then cosine similarities of saved outputs are respectively aggregated in cross layer loss.
  • Figure 2: Overview of the temporal loss computing. For each pair of frames from original and attacked videos difference $\Delta$ is computed. The temporal loss is computed as square root of sum of all differences.
  • Figure 3: The plot of variations of the IC2VQA attack under different configuration. The plot presents the median value of SRCC score across different epsilon with variation of the number of iterations.
  • Figure 4: Example of IC2VQA attack. Cross-layer loss is computed for layer1 of SPAQ, $\epsilon$ is set 50/255, number of iterations is set to 20. The visual quality of clean video is obviously higher than that of the attacked video, however, VSFA metric rates the attacked video as having higher quality.
  • Figure 5: Heatmap of cosine similarity between the features of VSFA layers and those from the NIMA and PaQ-2-PiQ layers. The values represent the cosine similarity scaled by a factor of 100.