Table of Contents
Fetching ...

Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality

Zhimin Hu, Riya Roshan, Sashank Varma

TL;DR

Are More Tokens Rational? investigates whether resource rationality emerges from inference-time scaling in large language models by deploying the Variable Attribution Task (VAT) to control task complexity. The study compares instruction-tuned models with large reasoning models and finds a robust shift from permutation to elimination strategies as complexity grows, with XOR/XNOR functions often resisting pruning, indicating nuanced resource allocation. Crucially, this adaptive behavior arises without explicit cost-based rewards, suggesting that resource rationality is an emergent property of extended inference. The findings imply that reasoning traces and token-length investments reflect internal resource reallocation under finite capacity, informing both interpretation of LLM behavior and design of computation-aware prompting strategies.

Abstract

Human reasoning is shaped by resource rationality -- optimizing performance under constraints. Recently, inference-time scaling has emerged as a powerful paradigm to improve the reasoning performance of Large Language Models by expanding test-time computation. Specifically, instruction-tuned (IT) models explicitly generate long reasoning steps during inference, whereas Large Reasoning Models (LRMs) are trained by reinforcement learning to discover reasoning paths that maximize accuracy. However, it remains unclear whether resource-rationality can emerge from such scaling without explicit reward related to computational costs. We introduce a Variable Attribution Task in which models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions. By varying the number of candidate variables and trials, we systematically manipulate task complexity. Both models exhibit a transition from brute-force to analytic strategies as complexity increases. IT models degrade on XOR and XNOR functions, whereas LRMs remain robust. These findings suggest that models can adjust their reasoning behavior in response to task complexity, even without explicit cost-based reward. It provides compelling evidence that resource rationality is an emergent property of inference-time scaling itself.

Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality

TL;DR

Are More Tokens Rational? investigates whether resource rationality emerges from inference-time scaling in large language models by deploying the Variable Attribution Task (VAT) to control task complexity. The study compares instruction-tuned models with large reasoning models and finds a robust shift from permutation to elimination strategies as complexity grows, with XOR/XNOR functions often resisting pruning, indicating nuanced resource allocation. Crucially, this adaptive behavior arises without explicit cost-based rewards, suggesting that resource rationality is an emergent property of extended inference. The findings imply that reasoning traces and token-length investments reflect internal resource reallocation under finite capacity, informing both interpretation of LLM behavior and design of computation-aware prompting strategies.

Abstract

Human reasoning is shaped by resource rationality -- optimizing performance under constraints. Recently, inference-time scaling has emerged as a powerful paradigm to improve the reasoning performance of Large Language Models by expanding test-time computation. Specifically, instruction-tuned (IT) models explicitly generate long reasoning steps during inference, whereas Large Reasoning Models (LRMs) are trained by reinforcement learning to discover reasoning paths that maximize accuracy. However, it remains unclear whether resource-rationality can emerge from such scaling without explicit reward related to computational costs. We introduce a Variable Attribution Task in which models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions. By varying the number of candidate variables and trials, we systematically manipulate task complexity. Both models exhibit a transition from brute-force to analytic strategies as complexity increases. IT models degrade on XOR and XNOR functions, whereas LRMs remain robust. These findings suggest that models can adjust their reasoning behavior in response to task complexity, even without explicit cost-based reward. It provides compelling evidence that resource rationality is an emergent property of inference-time scaling itself.
Paper Structure (16 sections, 5 figures, 1 table)

This paper contains 16 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Variable Attribution Task and its computational strategies. Given input–output records from $T$ trials and $N$ candidate variables, the model must identify the pair that determines the output via a latent Boolean function. Solving the task starts either from testing candidate pairs (permutation) or incrementally pruning inconsistent pairs (elimination).
  • Figure 2: Strategy distribution across task complexity. (Top) Heatmaps showing the proportion of elimination strategies along the $N$ and $T$, respectively, where the top and right marginal plots represent the mean proportion trends. (Bottom) Proportion of elimination for specific logical functions. (Left) Deepseek-R1. (Right) Qwen3-thinking.
  • Figure 3: The fitted decision landscape of strategy selection. The heatmap represents the predicted probability of choosing the elimination strategy based on the interaction model. The gray dots represent actual experimental samples, and the white line denotes the mean experimental path ($T_{mean}$ for each $N$). The dashed line indicates the 50% decision boundary.
  • Figure 4: Mean accuracy across logical functions as task complexity ($N$) increases (DeepSeek). From top to bottom: DeekSeek R1, V3, and V3 with direct answer.
  • Figure 5: Computational scaling behavior (DeepSeek R1). (Left) Average character count relative to $N$. (Right) Scaling relative to trials $T$. The dashed lines (XOR/XNOR) illustrate a higher intercept and steeper slope, indicating that logically complex tasks demand more inference resources.