Table of Contents
Fetching ...

Incorporating Token Usage into Prompting Strategy Evaluation

Chris Sypherd, Sergei Petrov, Sonny George, Vaishak Belle

TL;DR

This work tackles the practical problem of prompting inefficiency by introducing Big-$O_{tok}$, a theoretical framework for describing how token usage scales with prompting strategy variables, and Token Cost (TC), an empirical metric for tokens per accuracy. The authors combine theory and experiments across three prompting strategies, three benchmarks, and three open LLMs to show that additional tokens yield diminishing accuracy returns, illustrating the need for efficiency-aware evaluation. Their findings are supported by observed trends following a $y = \log(\log(x))$ pattern and quantified through average and marginal TC, with results aligning with the proposed Big-$O_{tok}$ categories. The work demonstrates how token usage can meaningfully affect real-world prompting utility and provides reproducible metrics to guide future, efficiency-conscious prompting research and practice.

Abstract

In recent years, large language models have demonstrated remarkable performance across diverse tasks. However, their task effectiveness is heavily dependent on the prompting strategy used to elicit output, which can vary widely in both performance and token usage. While task performance is often used to determine prompting strategy success, we argue that efficiency--balancing performance and token usage--can be a more practical metric for real-world utility. To enable this, we propose Big-$O_{tok}$, a theoretical framework for describing the token usage growth of prompting strategies, and analyze Token Cost, an empirical measure of tokens per performance. We apply these to several common prompting strategies and find that increased token usage leads to drastically diminishing performance returns. Our results validate the Big-$O_{tok}$ analyses and reinforce the need for efficiency-aware evaluations.

Incorporating Token Usage into Prompting Strategy Evaluation

TL;DR

This work tackles the practical problem of prompting inefficiency by introducing Big-, a theoretical framework for describing how token usage scales with prompting strategy variables, and Token Cost (TC), an empirical metric for tokens per accuracy. The authors combine theory and experiments across three prompting strategies, three benchmarks, and three open LLMs to show that additional tokens yield diminishing accuracy returns, illustrating the need for efficiency-aware evaluation. Their findings are supported by observed trends following a pattern and quantified through average and marginal TC, with results aligning with the proposed Big- categories. The work demonstrates how token usage can meaningfully affect real-world prompting utility and provides reproducible metrics to guide future, efficiency-conscious prompting research and practice.

Abstract

In recent years, large language models have demonstrated remarkable performance across diverse tasks. However, their task effectiveness is heavily dependent on the prompting strategy used to elicit output, which can vary widely in both performance and token usage. While task performance is often used to determine prompting strategy success, we argue that efficiency--balancing performance and token usage--can be a more practical metric for real-world utility. To enable this, we propose Big-, a theoretical framework for describing the token usage growth of prompting strategies, and analyze Token Cost, an empirical measure of tokens per performance. We apply these to several common prompting strategies and find that increased token usage leads to drastically diminishing performance returns. Our results validate the Big- analyses and reinforce the need for efficiency-aware evaluations.

Paper Structure

This paper contains 30 sections, 3 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Accuracy vs. token usage plots with standard error bars for various prompting strategies, models, and benchmarks. The trend lines demonstrate the rapid growth of TC for these strategies.
  • Figure 2: Sample derivations of Big-Otok. The textual descriptions in each figure are drawn from the following sources: (c) kojima2022largezeroshotcot; (d) FewShotLearners; (e) wei2022chain; (f) wang2023selfconsistency. Note that for (d), the fewshot examples are equivalent to the MVIO and we make the assumption that the LLM follows that pattern.
  • Figure 3: Accuracy and total token usage for the ablation study on the number of fewshot exemplars on the GSM8K benchmark. Standard error bars are included.
  • Figure 4: Accuracy and total token usage information for Qwen 2.5 14B and Qwen 2.5 32B from the empirical evaluation. The trend lines demonstrate the rapid growth of TC for these prompting strategies.