Table of Contents
Fetching ...

Merlin's Whisper: Enabling Efficient Reasoning in LLMs via Black-box Adversarial Prompting

Heming Xia, Cunxiao Du, Rui Li, Chak Tou Leong, Yongqi Li, Wenjie Li

TL;DR

This work tackles the problem of overthinking in large reasoning models by treating models as black-box communicators and introducing AdvPrompt, an iterative adversarial prompting framework. AdvPrompt synthesizes prompts from diverse perspectives to elicit concise reasoning while preserving accuracy, enabling practical deployment in both open-source LRMs and closed-source APIs. On four reasoning benchmarks (GSM8K, MATH-500, AMC 2023, AIME 2024), AdvPrompt achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 series and an average token reduction of 19% to 41% across the benchmarks; on Claude-3.7 and Gemini-2.5 on MATH-500 it reduces tokens by 35% and 47% respectively with minimal accuracy loss. The results demonstrate generalizability across model scales and families, supporting black-box prompting as a practical strategy for efficient reasoning in LRMs.

Abstract

Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex reasoning tasks through step-by-step thinking. However, such a lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of these models. In this work, we present a new perspective on mitigating overthinking in LRMs via black-box adversarial prompting. By treating both open-source LRMs and closed-source APIs as black-box communicators, we investigate how to elicit concise responses without sacrificing accuracy. We introduce AdvPrompt, an iterative refinement framework that generates high-quality adversarial prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that AdvPrompt consistently reduces token usage while preserving performance. Notably, AdvPrompt achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series, and delivers an average ~40% token reduction across four benchmarks. For closed-source APIs, AdvPrompt reduces token usage on MATH-500 by 35% for Claude-3.7 and 47% for Gemini-2.5. Further analysis reveals the generalizability of AdvPrompt across various model scales and families, underscoring the potential of black-box prompting as a practical and effective strategy for enhancing LRM efficiency.

Merlin's Whisper: Enabling Efficient Reasoning in LLMs via Black-box Adversarial Prompting

TL;DR

This work tackles the problem of overthinking in large reasoning models by treating models as black-box communicators and introducing AdvPrompt, an iterative adversarial prompting framework. AdvPrompt synthesizes prompts from diverse perspectives to elicit concise reasoning while preserving accuracy, enabling practical deployment in both open-source LRMs and closed-source APIs. On four reasoning benchmarks (GSM8K, MATH-500, AMC 2023, AIME 2024), AdvPrompt achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 series and an average token reduction of 19% to 41% across the benchmarks; on Claude-3.7 and Gemini-2.5 on MATH-500 it reduces tokens by 35% and 47% respectively with minimal accuracy loss. The results demonstrate generalizability across model scales and families, supporting black-box prompting as a practical strategy for efficient reasoning in LRMs.

Abstract

Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex reasoning tasks through step-by-step thinking. However, such a lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of these models. In this work, we present a new perspective on mitigating overthinking in LRMs via black-box adversarial prompting. By treating both open-source LRMs and closed-source APIs as black-box communicators, we investigate how to elicit concise responses without sacrificing accuracy. We introduce AdvPrompt, an iterative refinement framework that generates high-quality adversarial prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that AdvPrompt consistently reduces token usage while preserving performance. Notably, AdvPrompt achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series, and delivers an average ~40% token reduction across four benchmarks. For closed-source APIs, AdvPrompt reduces token usage on MATH-500 by 35% for Claude-3.7 and 47% for Gemini-2.5. Further analysis reveals the generalizability of AdvPrompt across various model scales and families, underscoring the potential of black-box prompting as a practical and effective strategy for enhancing LRM efficiency.

Paper Structure

This paper contains 37 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Unlike the verbose reasoning typical of LRMs (upper), AdvPrompt reveals that appending an adversarial prompt (e.g., false evidence) elicits concise reasoning without compromising performance (lower).
  • Figure 2: Illustration of AdvPrompt. Unlike prior approaches to efficient reasoning, AdvPrompt casts the task as a black-box adversarial prompting problem. Given a black-box model or API, the framework (left) generates high-quality adversarial prompts from diverse perspectives (right) to elicit concise reasoning and iteratively refines the prompts to improve the efficiency–performance trade-off.
  • Figure 3: Experimental results of AdvPrompt using closed-source APIs on MATH-500. It effectively reduces token usage by $35\%$--$47\%$ while maintaining comparable reasoning performance.
  • Figure 4: Compression ratios (%) of top-$5$ prompt candidates across perspectives. The Evidence perspective is most effective on Qwen3-8B, while RolePlay candidates perform best on Deepseek-R1-Distill-Qwen-14B.
  • Figure 5: Generalizability of top-performing prompt candidates across Qwen3 model scales. Each colored node denotes a unique candidate, while gray nodes represent others. Notably, Evidence-Iappears among the top-5 candidates across all model scales.
  • ...and 4 more figures