Table of Contents
Fetching ...

FlipLLM: Efficient Bit-Flip Attacks on Multimodal LLMs using Reinforcement Learning

Khurram Khalil, Khaza Anuarul Hoque

TL;DR

FlipLLM reframes bit-flip attack discovery on large language and vision-language models as a sequential decision problem. It combines sensitivity-guided pruning with Q-learning to efficiently identify minimal, high-impact bit sets, demonstrating 2.5× faster discovery and successful degradation across diverse architectures. The framework reveals consistent vulnerability in attention projections and normalization layers and shows that hardware protections like ECC SECDED can mitigate such attacks, underscoring the need for hardware-aware defenses. Overall, FlipLLM provides a scalable, architecture-agnostic toolkit for rigorous hardware-security evaluation of modern foundation models.

Abstract

Generative Artificial Intelligence models, such as Large Language Models (LLMs) and Large Vision Models (VLMs), exhibit state-of-the-art performance but remain vulnerable to hardware-based threats, specifically bit-flip attacks (BFAs). Existing BFA discovery methods lack generalizability and struggle to scale, often failing to analyze the vast parameter space and complex interdependencies of modern foundation models in a reasonable time. This paper proposes FlipLLM, a reinforcement learning (RL) architecture-agnostic framework that formulates BFA discovery as a sequential decision-making problem. FlipLLM combines sensitivity-guided layer pruning with Q-learning to efficiently identify minimal, high-impact bit sets that can induce catastrophic failure. We demonstrate the effectiveness and generalizability of FlipLLM by applying it to a diverse set of models, including prominent text-only LLMs (GPT-2 Large, LLaMA 3.1 8B, and DeepSeek-V2 7B), VLMs such as LLaVA 1.6, and datasets, such as MMLU, MMLU-Pro, VQAv2, and TextVQA. Our results show that FlipLLM can identify critical bits that are vulnerable to BFAs up to 2.5x faster than SOTA methods. We demonstrate that flipping the FlipLLM-identified bits plummets the accuracy of LLaMA 3.1 8B from 69.9% to ~0.2%, and for LLaVA's VQA score from 78% to almost 0%, by flipping as few as 5 and 7 bits, respectively. Further analysis reveals that applying standard hardware protection mechanisms, such as ECC SECDED, to the FlipLLM-identified bit locations completely mitigates the BFA impact, demonstrating the practical value of our framework in guiding hardware-level defenses. FlipLLM offers the first scalable and adaptive methodology for exploring the BFA vulnerability of both language and multimodal foundation models, paving the way for comprehensive hardware-security evaluation.

FlipLLM: Efficient Bit-Flip Attacks on Multimodal LLMs using Reinforcement Learning

TL;DR

FlipLLM reframes bit-flip attack discovery on large language and vision-language models as a sequential decision problem. It combines sensitivity-guided pruning with Q-learning to efficiently identify minimal, high-impact bit sets, demonstrating 2.5× faster discovery and successful degradation across diverse architectures. The framework reveals consistent vulnerability in attention projections and normalization layers and shows that hardware protections like ECC SECDED can mitigate such attacks, underscoring the need for hardware-aware defenses. Overall, FlipLLM provides a scalable, architecture-agnostic toolkit for rigorous hardware-security evaluation of modern foundation models.

Abstract

Generative Artificial Intelligence models, such as Large Language Models (LLMs) and Large Vision Models (VLMs), exhibit state-of-the-art performance but remain vulnerable to hardware-based threats, specifically bit-flip attacks (BFAs). Existing BFA discovery methods lack generalizability and struggle to scale, often failing to analyze the vast parameter space and complex interdependencies of modern foundation models in a reasonable time. This paper proposes FlipLLM, a reinforcement learning (RL) architecture-agnostic framework that formulates BFA discovery as a sequential decision-making problem. FlipLLM combines sensitivity-guided layer pruning with Q-learning to efficiently identify minimal, high-impact bit sets that can induce catastrophic failure. We demonstrate the effectiveness and generalizability of FlipLLM by applying it to a diverse set of models, including prominent text-only LLMs (GPT-2 Large, LLaMA 3.1 8B, and DeepSeek-V2 7B), VLMs such as LLaVA 1.6, and datasets, such as MMLU, MMLU-Pro, VQAv2, and TextVQA. Our results show that FlipLLM can identify critical bits that are vulnerable to BFAs up to 2.5x faster than SOTA methods. We demonstrate that flipping the FlipLLM-identified bits plummets the accuracy of LLaMA 3.1 8B from 69.9% to ~0.2%, and for LLaVA's VQA score from 78% to almost 0%, by flipping as few as 5 and 7 bits, respectively. Further analysis reveals that applying standard hardware protection mechanisms, such as ECC SECDED, to the FlipLLM-identified bit locations completely mitigates the BFA impact, demonstrating the practical value of our framework in guiding hardware-level defenses. FlipLLM offers the first scalable and adaptive methodology for exploring the BFA vulnerability of both language and multimodal foundation models, paving the way for comprehensive hardware-security evaluation.

Paper Structure

This paper contains 37 sections, 5 equations, 6 figures, 6 tables, 3 algorithms.

Figures (6)

  • Figure 1: Overview of the FlipLLM framework.
  • Figure 2: MMLU Accuracy vs. Number of bit flips for LLaMA 3.1 8B, DeepSeek V2 and GPT-2 Large model.
  • Figure 3: Computational scalability of FlipLLM's Q-Learning phase on LLaMA 3.1 8B, showing linear runtime growth ($R^2 > 0.99$) and predictable, super-linear memory usage ($O(k^{1.3})$) with respect to the initial candidate parameter set size, $k$.
  • Figure 4: Layer sensitivity profiling for LLaMA 3.1 8B, showing accuracy after perturbing the top 0.1% sensitive parameters in each layer. Attention and normalization layers show the highest vulnerability.
  • Figure 5: FlipLLM Q-learning dynamics on LLaMA 3.1 8B over 50 generations. The agent rapidly reduces the critical bit set size ($|s_t|$, purple) while improving the impact-per-flip (approximated by negative reward, red).
  • ...and 1 more figures