Table of Contents
Fetching ...

When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack

Hanqing Liu, Shouwei Ruan, Yao Huang, Shiji Zhao, Xingxing Wei

TL;DR

This work probes Vision-Language Models for robustness to real-world illumination changes by introducing Illumination Transformation Attack (ITA). ITA models global illumination with multiple parameterized point light sources and uses physics-based reconstruction (IC-Light) combined with a gradient-free CMA-ES optimization to produce natural, illumination-aware adversarial examples that degrade VLM performance across tasks and models. It integrates a CLIP-based adversarial objective with perceptual and diversity constraints (LPIPS and a distance penalty) to preserve realism while maximizing misalignment with ground-truth labels. Experiments on COCO across zero-shot classification, image captioning, and VQA reveal significant vulnerability of a range of VLMs, including CLIP variants and LVLMs, to illumination shifts, underscoring the need for illumination-aware robustness in practical deployments. The framework provides a principled, scalable way to stress-test VLMs under realistic lighting scenarios and offers guidance for improving resilience against environmental illumination variations.

Abstract

Vision-Language Models (VLMs) have achieved remarkable success in various tasks, yet their robustness to real-world illumination variations remains largely unexplored. To bridge this gap, we propose \textbf{I}llumination \textbf{T}ransformation \textbf{A}ttack (\textbf{ITA}), the first framework to systematically assess VLMs' robustness against illumination changes. However, there still exist two key challenges: (1) how to model global illumination with fine-grained control to achieve diverse lighting conditions and (2) how to ensure adversarial effectiveness while maintaining naturalness. To address the first challenge, we innovatively decompose global illumination into multiple parameterized point light sources based on the illumination rendering equation. This design enables us to model more diverse lighting variations that previous methods could not capture. Then, by integrating these parameterized lighting variations with physics-based lighting reconstruction techniques, we could precisely render such light interactions in the original scenes, finally meeting the goal of fine-grained lighting control. For the second challenge, by controlling illumination through the lighting reconstrution model's latent space rather than direct pixel manipulation, we inherently preserve physical lighting priors. Furthermore, to prevent potential reconstruction artifacts, we design additional perceptual constraints for maintaining visual consistency with original images and diversity constraints for avoiding light source convergence. Extensive experiments demonstrate that our ITA could significantly reduce the performance of advanced VLMs, e.g., LLaVA-1.6, while possessing competitive naturalness, exposing VLMS' critical illuminiation vulnerabilities.

When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack

TL;DR

This work probes Vision-Language Models for robustness to real-world illumination changes by introducing Illumination Transformation Attack (ITA). ITA models global illumination with multiple parameterized point light sources and uses physics-based reconstruction (IC-Light) combined with a gradient-free CMA-ES optimization to produce natural, illumination-aware adversarial examples that degrade VLM performance across tasks and models. It integrates a CLIP-based adversarial objective with perceptual and diversity constraints (LPIPS and a distance penalty) to preserve realism while maximizing misalignment with ground-truth labels. Experiments on COCO across zero-shot classification, image captioning, and VQA reveal significant vulnerability of a range of VLMs, including CLIP variants and LVLMs, to illumination shifts, underscoring the need for illumination-aware robustness in practical deployments. The framework provides a principled, scalable way to stress-test VLMs under realistic lighting scenarios and offers guidance for improving resilience against environmental illumination variations.

Abstract

Vision-Language Models (VLMs) have achieved remarkable success in various tasks, yet their robustness to real-world illumination variations remains largely unexplored. To bridge this gap, we propose \textbf{I}llumination \textbf{T}ransformation \textbf{A}ttack (\textbf{ITA}), the first framework to systematically assess VLMs' robustness against illumination changes. However, there still exist two key challenges: (1) how to model global illumination with fine-grained control to achieve diverse lighting conditions and (2) how to ensure adversarial effectiveness while maintaining naturalness. To address the first challenge, we innovatively decompose global illumination into multiple parameterized point light sources based on the illumination rendering equation. This design enables us to model more diverse lighting variations that previous methods could not capture. Then, by integrating these parameterized lighting variations with physics-based lighting reconstruction techniques, we could precisely render such light interactions in the original scenes, finally meeting the goal of fine-grained lighting control. For the second challenge, by controlling illumination through the lighting reconstrution model's latent space rather than direct pixel manipulation, we inherently preserve physical lighting priors. Furthermore, to prevent potential reconstruction artifacts, we design additional perceptual constraints for maintaining visual consistency with original images and diversity constraints for avoiding light source convergence. Extensive experiments demonstrate that our ITA could significantly reduce the performance of advanced VLMs, e.g., LLaVA-1.6, while possessing competitive naturalness, exposing VLMS' critical illuminiation vulnerabilities.

Paper Structure

This paper contains 14 sections, 11 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Revealing VLM Vulnerabilities to Illumination Variations. Comparison of the predictions made by OpenCLIP ViT-B/16 on natural images with normal and OOD illumination versus illumination-aware adversarial examples generated by our Illumination Transformation Attack (ITA), which illustrate VLMs are vulnerable to illumination changes, underscoring the need for systematic evaluation.
  • Figure 2: Overview of the Proposed ITA Framework. Three previously proposed illumination attack methods rely on localized lighting disturbances (top). In contrast, our method applies illumination transformations across the entire scene, generating illumination-aware adversarial examples, ensuring both naturalness and adversariality (bottom).
  • Figure 3: Visualization Results. From up to down, the images represent the results of Clean, Natural Light Attack, Shadow Attack, and our method, along with their corresponding Top-1 labels, evaluated on OpenCLIP ViT-B/16. The red color indicates misclassified labels.(Left) Visualization of some illumination changes causing LVLMs to give incorrect answers.(Right)
  • Figure 4: Ablation Study Results on Optimization Hyperparameters. Attack success rate of illumination-aware adversarial examples with (A) different numbers of light sources at different iteration steps on OpenCLIP and (B) different population number across different CLIP versions in 200 iteration steps.
  • Figure 5: Ablation Study Results on Weight Factors. (a) Visual examples of illumination changes as $\alpha$ varies, along with the corresponding predicted labels in OpenCLIP. (b) Visualization of the final optimized illumination as $\beta$ varies across the same sample with three light sources.