Table of Contents
Fetching ...

Can LLMs Perceive Time? An Empirical Investigation

Aniketh Garikaparthi

Abstract

Large language models cannot estimate how long their own tasks take. We investigate this limitation through four experiments across 68 tasks and four model families. Pre-task estimates overshoot actual duration by 4--7$\times$ ($p < 0.001$), with models predicting human-scale minutes for tasks completing in seconds. Relative ordering fares no better: on task pairs designed to expose heuristic reliance, models score at or below chance (GPT-5: 18\% on counter-intuitive pairs, $p = 0.033$), systematically failing when complexity labels mislead. Post-hoc recall is disconnected from reality -- estimates diverge from actuals by an order of magnitude in either direction. These failures persist in multi-step agentic settings, with errors of 5--10$\times$. The models possess propositional knowledge about duration from training but lack experiential grounding in their own inference time, with practical implications for agent scheduling, planning and time-critical scenarios.

Can LLMs Perceive Time? An Empirical Investigation

Abstract

Large language models cannot estimate how long their own tasks take. We investigate this limitation through four experiments across 68 tasks and four model families. Pre-task estimates overshoot actual duration by 4--7 (), with models predicting human-scale minutes for tasks completing in seconds. Relative ordering fares no better: on task pairs designed to expose heuristic reliance, models score at or below chance (GPT-5: 18\% on counter-intuitive pairs, ), systematically failing when complexity labels mislead. Post-hoc recall is disconnected from reality -- estimates diverge from actuals by an order of magnitude in either direction. These failures persist in multi-step agentic settings, with errors of 5--10. The models possess propositional knowledge about duration from training but lack experiential grounding in their own inference time, with practical implications for agent scheduling, planning and time-critical scenarios.

Paper Structure

This paper contains 25 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Estimation calibration across models. Each point represents one task; dashed line indicates perfect calibration. Frontier models (GPT-5, GPT-4o) show weak positive correlation, while open models (OLMo, Qwen) cluster around arbitrary values with no relationship to actual duration.
  • Figure 2: Relative ordering accuracy on 26 hard pairs (left) and by pair type (right). Counter-intuitive pairs expose heuristic reliance---GPT-5 scores 18% on 11 CI pairs ($p = 0.033$, one-sided binomial), significantly below chance. Overall accuracy ranges from 46--58%, consistent with near-random performance on diagnostically hard pairs.
  • Figure 3: GPT-5 calibration by reasoning effort level. Higher effort reduces overestimation ratio (left) as actual duration increases to match human-scale estimates. Correlation improves slightly (right) but remains driven by task complexity, not self-awareness.
  • Figure 4: Qwen3-8B calibration with and without thinking mode. Thinking improves correlation ($r = 0.44$ vs $r = 0.18$) but increases underestimation as the model fails to account for reasoning overhead.
  • Figure 5: Agentic task estimation errors. Pre-task estimates (left) overshoot by 5--10$\times$. Post-hoc estimates (right) show even larger disconnection from actual duration, with GPT-4o's failed tasks producing extreme underestimates.