Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments
Maral Doctorarastoo, Katherine A. Flanigan, Mario Bergés, Christopher McComb
TL;DR
The paper tackles the problem of forecasting human activities and durations in smart environments under limited labeled data. It proposes a retrieval-augmented prompting framework for large language models that integrates temporal, spatial, behavioral history, and persona context, evaluated on the CASAS Aruba dataset with next-activity and multi-step rollout tasks. Key findings show strong intrinsic temporal understanding in LLMs even in zero-shot settings, with 1–2 demonstrations providing the best balance between accuracy and efficiency and diminishing returns beyond that; sequence-level DTW indicates coherent temporal alignment relative to baselines. The results suggest that pre-trained language models can serve as effective temporal reasoners for agent-based models and smart-environment simulations, enabling robust behavior modeling with minimal supervision and data requirements.
Abstract
Anticipating human activities and their durations is essential in applications such as smart-home automation, simulation-based architectural and urban design, activity-based transportation system simulation, and human-robot collaboration, where adaptive systems must respond to human activities. Existing data-driven agent-based models--from rule-based to deep learning--struggle in low-data environments, limiting their practicality. This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap by reasoning about everyday activities from compact contextual cues. We adopt a retrieval-augmented prompting strategy that integrates four sources of context--temporal, spatial, behavioral history, and persona--and evaluate it on the CASAS Aruba smart-home dataset. The evaluation spans two complementary tasks: next-activity prediction with duration estimation, and multi-step daily sequence generation, each tested with various numbers of few-shot examples provided in the prompt. Analyzing few-shot effects reveals how much contextual supervision is sufficient to balance data efficiency and predictive accuracy, particularly in low-data environments. Results show that large language models exhibit strong inherent temporal understanding of human behavior: even in zero-shot settings, they produce coherent daily activity predictions, while adding one or two demonstrations further refines duration calibration and categorical accuracy. Beyond a few examples, performance saturates, indicating diminishing returns. Sequence-level evaluation confirms consistent temporal alignment across few-shot conditions. These findings suggest that pre-trained language models can serve as promising temporal reasoners, capturing both recurring routines and context-dependent behavioral variations, thereby strengthening the behavioral modules of agent-based models.
