Table of Contents
Fetching ...

Empirical Characterization of Temporal Constraint Processing in LLMs

Javier Marín

TL;DR

This paper addresses how production-scale LLMs handle temporal constraints in time-critical tasks by evaluating eight models (2.8–8B) on eight deadline-detection scenarios under two prompt formats. It reveals a bimodal accuracy distribution, extreme prompt brittleness, and a systematic action bias, indicating that temporal constraint satisfaction is not reliably learned through next-token prediction alone. Although fine-tuning with 200 diverse examples yields 12–37 percentage-point gains for partially capable models, scale does not predict capability, and robust temporal reasoning appears to require architectural mechanisms for a continuous temporal state, explicit constraint checking, and compositional temporal reasoning. The authors argue for explicit temporal constraint testing before deployment and propose hybrid architectures integrating symbolic reasoning to mitigate risk in time-sensitive applications.

Abstract

When deploying LLMs in agentic architectures requiring real-time decisions under temporal constraints, we assume they reliably determine whether action windows remain open or have closed. This assumption is untested. We characterize temporal constraint processing across eight production-scale models (2.8-8B parameters) using deadline detection tasks, revealing systematic deployment risks: bimodal performance distribution (models achieve either 95% or 50% accuracy), extreme prompt brittleness (30-60 percentage point swings from formatting changes alone), and systematic action bias (100% false positive rates in failing models). Parameter count shows no correlation with capability in this range-a 3.8B model matches 7B models while other 7B models fail completely. Fine-tuning on 200 synthetic examples improves models with partial capability by 12-37 percentage points. We demonstrate that temporal constraint satisfaction cannot be reliably learned through next-token prediction on natural language, even with targeted fine-tuning. This capability requires architectural mechanisms for: (1) continuous temporal state representation, (2) explicit constraint checking separate from linguistic pattern matching, (3) systematic compositional reasoning over temporal relations. Current autoregressive architectures lack these mechanisms. Deploying such systems in time-critical applications without hybrid architectures incorporating symbolic reasoning modules represents unacceptable risk.

Empirical Characterization of Temporal Constraint Processing in LLMs

TL;DR

This paper addresses how production-scale LLMs handle temporal constraints in time-critical tasks by evaluating eight models (2.8–8B) on eight deadline-detection scenarios under two prompt formats. It reveals a bimodal accuracy distribution, extreme prompt brittleness, and a systematic action bias, indicating that temporal constraint satisfaction is not reliably learned through next-token prediction alone. Although fine-tuning with 200 diverse examples yields 12–37 percentage-point gains for partially capable models, scale does not predict capability, and robust temporal reasoning appears to require architectural mechanisms for a continuous temporal state, explicit constraint checking, and compositional temporal reasoning. The authors argue for explicit temporal constraint testing before deployment and propose hybrid architectures integrating symbolic reasoning to mitigate risk in time-sensitive applications.

Abstract

When deploying LLMs in agentic architectures requiring real-time decisions under temporal constraints, we assume they reliably determine whether action windows remain open or have closed. This assumption is untested. We characterize temporal constraint processing across eight production-scale models (2.8-8B parameters) using deadline detection tasks, revealing systematic deployment risks: bimodal performance distribution (models achieve either 95% or 50% accuracy), extreme prompt brittleness (30-60 percentage point swings from formatting changes alone), and systematic action bias (100% false positive rates in failing models). Parameter count shows no correlation with capability in this range-a 3.8B model matches 7B models while other 7B models fail completely. Fine-tuning on 200 synthetic examples improves models with partial capability by 12-37 percentage points. We demonstrate that temporal constraint satisfaction cannot be reliably learned through next-token prediction on natural language, even with targeted fine-tuning. This capability requires architectural mechanisms for: (1) continuous temporal state representation, (2) explicit constraint checking separate from linguistic pattern matching, (3) systematic compositional reasoning over temporal relations. Current autoregressive architectures lack these mechanisms. Deploying such systems in time-critical applications without hybrid architectures incorporating symbolic reasoning modules represents unacceptable risk.

Paper Structure

This paper contains 22 sections, 3 tables.