Table of Contents
Fetching ...

ThinkTrap: Denial-of-Service Attacks against Black-box LLM Services via Infinite Thinking

Yunzhe Li, Jianan Wang, Hongzi Zhu, James Lin, Shan Chang, Minyi Guo

TL;DR

ThinkTrap reveals a denial-of-service vulnerability in black-box LLM services by exploiting unbounded reasoning through carefully crafted prompts. It introduces a two-stage framework that offline-optimizes prompts in a low-dimensional surrogate space with CMA-ES, then online injects them at modest rates to degrade service. Across commercial and self-hosted LLMs, ThinkTrap achieves long outputs, substantial latency increases, and GPU memory exhaustion with modest costs, showcasing a practical threat to LLM infrastructure. The work also analyzes defenses, finding resource-aware scheduling effective but with QoS trade-offs, and advocates prompt-level defenses for robust, scalable LLM hosting.

Abstract

Large Language Models (LLMs) have become foundational components in a wide range of applications, including natural language understanding and generation, embodied intelligence, and scientific discovery. As their computational requirements continue to grow, these models are increasingly deployed as cloud-based services, allowing users to access powerful LLMs via the Internet. However, this deployment model introduces a new class of threat: denial-of-service (DoS) attacks via unbounded reasoning, where adversaries craft specially designed inputs that cause the model to enter excessively long or infinite generation loops. These attacks can exhaust backend compute resources, degrading or denying service to legitimate users. To mitigate such risks, many LLM providers adopt a closed-source, black-box setting to obscure model internals. In this paper, we propose ThinkTrap, a novel input-space optimization framework for DoS attacks against LLM services even in black-box environments. The core idea of ThinkTrap is to first map discrete tokens into a continuous embedding space, then undertake efficient black-box optimization in a low-dimensional subspace exploiting input sparsity. The goal of this optimization is to identify adversarial prompts that induce extended or non-terminating generation across several state-of-the-art LLMs, achieving DoS with minimal token overhead. We evaluate the proposed attack across multiple commercial, closed-source LLM services. Our results demonstrate that, even far under the restrictive request frequency limits commonly enforced by these platforms, typically capped at ten requests per minute (10 RPM), the attack can degrade service throughput to as low as 1% of its original capacity, and in some cases, induce complete service failure.

ThinkTrap: Denial-of-Service Attacks against Black-box LLM Services via Infinite Thinking

TL;DR

ThinkTrap reveals a denial-of-service vulnerability in black-box LLM services by exploiting unbounded reasoning through carefully crafted prompts. It introduces a two-stage framework that offline-optimizes prompts in a low-dimensional surrogate space with CMA-ES, then online injects them at modest rates to degrade service. Across commercial and self-hosted LLMs, ThinkTrap achieves long outputs, substantial latency increases, and GPU memory exhaustion with modest costs, showcasing a practical threat to LLM infrastructure. The work also analyzes defenses, finding resource-aware scheduling effective but with QoS trade-offs, and advocates prompt-level defenses for robust, scalable LLM hosting.

Abstract

Large Language Models (LLMs) have become foundational components in a wide range of applications, including natural language understanding and generation, embodied intelligence, and scientific discovery. As their computational requirements continue to grow, these models are increasingly deployed as cloud-based services, allowing users to access powerful LLMs via the Internet. However, this deployment model introduces a new class of threat: denial-of-service (DoS) attacks via unbounded reasoning, where adversaries craft specially designed inputs that cause the model to enter excessively long or infinite generation loops. These attacks can exhaust backend compute resources, degrading or denying service to legitimate users. To mitigate such risks, many LLM providers adopt a closed-source, black-box setting to obscure model internals. In this paper, we propose ThinkTrap, a novel input-space optimization framework for DoS attacks against LLM services even in black-box environments. The core idea of ThinkTrap is to first map discrete tokens into a continuous embedding space, then undertake efficient black-box optimization in a low-dimensional subspace exploiting input sparsity. The goal of this optimization is to identify adversarial prompts that induce extended or non-terminating generation across several state-of-the-art LLMs, achieving DoS with minimal token overhead. We evaluate the proposed attack across multiple commercial, closed-source LLM services. Our results demonstrate that, even far under the restrictive request frequency limits commonly enforced by these platforms, typically capped at ten requests per minute (10 RPM), the attack can degrade service throughput to as low as 1% of its original capacity, and in some cases, induce complete service failure.

Paper Structure

This paper contains 45 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Attack overview of the proposed ThinkTrap system, where the attack prompts are first generated offline and then injected into the LLM in a stealthy way to conduct a denial-of-service attack.
  • Figure 2: Output length of DeepSeek R1 with respect to the upper bound of 4096 on ThinkTrap under (a) different prompt lengths and (b) different latent vector dimensions, where a non-monotonic trend can be observed in both hyperparameters for a balance of prompt expressiveness and search efficiency.
  • Figure 3: Output length of the evaluated eight LLMs on ThinkTrap and all the four baselines with respect to the upper bound of 4096, where different baseline methods exhibit varying performance across different models, but ThinkTrap consistently achieves the highest output length across all LLMs. The advantage of ThinkTrap is particularly evident under lower generation budgets, demonstrating its efficiency in maximizing output with minimal resources.
  • Figure 4: Output length relative to the maximum limit of 4096 tokens for the eight evaluated LLMs under ThinkTrap, across varying decoding temperatures (i.e., 0, 0.7, 1, 1.7), where higher temperatures, introducing greater sampling randomness, consistently result in longer outputs.
  • Figure 5: Impact of ThinkTrap attack on the DeepSeek Llama service with a just allowed attack rate of 10 RPM based on the Transformers library using 4 NVIDIA 2080ti GPUs with different output token limitations, where only the unrealistic limitation of 128 tokens can successfully defend the attack.
  • ...and 2 more figures