Table of Contents
Fetching ...

LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language Models

Xiaoning Feng, Xiaohong Han, Simin Chen, Wei Yang

TL;DR

This work uncovers a notable vulnerability in large language models: small, imperceptible input perturbations can dramatically inflate computation costs, latency, and energy use. It introduces LLMEffiChecker, a dual white-box/black-box framework that perturbs seed inputs at character, token, and structural levels to maximize decoder invocations while staying inconspicuous. Through extensive experiments on nine public LLMs across translation, sentence completion, and code generation tasks, the approach yields substantial degradation in efficiency (up to thousands of percent in latency and energy) with perturbations of 1–3 tokens, and demonstrates the need for runtime detectors to mitigate such attacks. The study provides concrete methodology, metrics, and mitigation strategies, highlighting practical implications for latency-sensitive deployments and energy-constrained devices.

Abstract

In this paper, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art LLMs. By analyzing the working mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in LLMs that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that LLMs would have to go through enough iterations to satisfy the pre-configured threshold. We present \tool, which can work under both white-box setting and black-box setting. In the white-box scenario, \tool develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario, \tool employs a causal inference-based approach to find critical tokens and similarly applies three levels of imperceptible perturbation to them. Both the white-box and black-box settings effectively delay the appearance of EOS, compelling these inputs to reach the naturally-unreachable threshold. To demonstrate the effectiveness of \tool, we conduct a systematic evaluation on nine public-available LLMs: Google T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google FLAN-T5, MBZUAI LaMini-GPT and Salesforce CodeGen. Experimental results show that \tool can increase on average LLMs' response latency and energy consumption by 325\% to 3244\% and 344\% to 3616\%, respectively, by perturbing just one character or token in the input sentence.

LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language Models

TL;DR

This work uncovers a notable vulnerability in large language models: small, imperceptible input perturbations can dramatically inflate computation costs, latency, and energy use. It introduces LLMEffiChecker, a dual white-box/black-box framework that perturbs seed inputs at character, token, and structural levels to maximize decoder invocations while staying inconspicuous. Through extensive experiments on nine public LLMs across translation, sentence completion, and code generation tasks, the approach yields substantial degradation in efficiency (up to thousands of percent in latency and energy) with perturbations of 1–3 tokens, and demonstrates the need for runtime detectors to mitigate such attacks. The study provides concrete methodology, metrics, and mitigation strategies, highlighting practical implications for latency-sensitive deployments and energy-constrained devices.

Abstract

In this paper, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art LLMs. By analyzing the working mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in LLMs that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that LLMs would have to go through enough iterations to satisfy the pre-configured threshold. We present \tool, which can work under both white-box setting and black-box setting. In the white-box scenario, \tool develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario, \tool employs a causal inference-based approach to find critical tokens and similarly applies three levels of imperceptible perturbation to them. Both the white-box and black-box settings effectively delay the appearance of EOS, compelling these inputs to reach the naturally-unreachable threshold. To demonstrate the effectiveness of \tool, we conduct a systematic evaluation on nine public-available LLMs: Google T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google FLAN-T5, MBZUAI LaMini-GPT and Salesforce CodeGen. Experimental results show that \tool can increase on average LLMs' response latency and energy consumption by 325\% to 3244\% and 344\% to 3616\%, respectively, by perturbing just one character or token in the input sentence.
Paper Structure (31 sections, 6 equations, 13 figures, 14 tables)

This paper contains 31 sections, 6 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Working mechanism of LLMs
  • Figure 2: Examples illustrating LLMs' efficiency degradation by inserting one character (using HuggingFace API)
  • Figure 3: The distribution of max_length values
  • Figure 4: Design overview of LLMEffiChecker
  • Figure 5: Constituency tree of sentence
  • ...and 8 more figures