Table of Contents
Fetching ...

An Empirical Study of LLM Reasoning Ability Under Strict Output Length Constraint

Yi Sun, Han Wang, Jiaqiang Li, Jiacheng Liu, Xiangyu Li, Hao Wen, Yizhen Yuan, Huiwen Zheng, Yan Liang, Yuanchun Li, Yunxin Liu

TL;DR

The paper investigates how large language models perform reasoning when constrained by strict output token budgets, addressing real-world latency requirements. It introduces two constraint schemes—direct termination and early stopping with a concluding message—to fairly evaluate constrained reasoning across 30 open-source LLMs on GSM8K and MATH500, and maps token budgets to on-device latency. Across datasets and prompt styles, the study finds that early stopping generally yields higher accuracy than direct termination, and that mid-sized models can be more latency-efficient than larger ones under tight budgets. The results provide actionable guidance for deploying LLMs in time-sensitive settings and are supported by open-source code and data for reproducibility.

Abstract

Recent work has demonstrated the remarkable potential of Large Language Models (LLMs) in test-time scaling. By making models think before answering, they are able to achieve much higher accuracy with extra inference computation. However, in many real-world scenarios, models are used under time constraints, where an answer should be given within a certain output length. It is unclear whether and how the reasoning ability of different LLMs remain effective under strict constraints. We take a first look at this problem by conducting an in-depth empirical study. Specifically, we test 30 LLMs on common reasoning datasets under a wide range of output length budgets, and we analyze the correlation between the inference accuracy and various properties including model type, model size, prompt style, etc. We also consider the mappings between token budgets and actual on-device latency budgets. The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning ability that differ from the unconstrained situation, e.g. the optimal choices of either model size or prompt style change under different budgets. These findings offer timely evaluation to this area and practical guidance for users to deploy LLMs under real-world latency constraints.

An Empirical Study of LLM Reasoning Ability Under Strict Output Length Constraint

TL;DR

The paper investigates how large language models perform reasoning when constrained by strict output token budgets, addressing real-world latency requirements. It introduces two constraint schemes—direct termination and early stopping with a concluding message—to fairly evaluate constrained reasoning across 30 open-source LLMs on GSM8K and MATH500, and maps token budgets to on-device latency. Across datasets and prompt styles, the study finds that early stopping generally yields higher accuracy than direct termination, and that mid-sized models can be more latency-efficient than larger ones under tight budgets. The results provide actionable guidance for deploying LLMs in time-sensitive settings and are supported by open-source code and data for reproducibility.

Abstract

Recent work has demonstrated the remarkable potential of Large Language Models (LLMs) in test-time scaling. By making models think before answering, they are able to achieve much higher accuracy with extra inference computation. However, in many real-world scenarios, models are used under time constraints, where an answer should be given within a certain output length. It is unclear whether and how the reasoning ability of different LLMs remain effective under strict constraints. We take a first look at this problem by conducting an in-depth empirical study. Specifically, we test 30 LLMs on common reasoning datasets under a wide range of output length budgets, and we analyze the correlation between the inference accuracy and various properties including model type, model size, prompt style, etc. We also consider the mappings between token budgets and actual on-device latency budgets. The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning ability that differ from the unconstrained situation, e.g. the optimal choices of either model size or prompt style change under different budgets. These findings offer timely evaluation to this area and practical guidance for users to deploy LLMs under real-world latency constraints.

Paper Structure

This paper contains 30 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Two methods used in our work to ensure strict output length constraints for LLM reasoning.
  • Figure 2: Early stopping method (solid line) outperforms directly terminating (dashed line) on GSM8K (left) and MATH500 (right) datasets. Prompting style: sbs, c2f and aav.
  • Figure 3: Under token budget, small models can outperform larger models on both datasets.
  • Figure 4: Under token budgets, reasoning models are not always better than instruction tuned or math models.
  • Figure 5: Comparison of Qwen-2.5-Instruct models on NVIDIA A800 GPU under infernce latency budget.
  • ...and 11 more figures