Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

Maral Doctorarastoo; Katherine A. Flanigan; Mario Bergés; Christopher McComb

Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

Maral Doctorarastoo, Katherine A. Flanigan, Mario Bergés, Christopher McComb

TL;DR

The paper tackles the problem of forecasting human activities and durations in smart environments under limited labeled data. It proposes a retrieval-augmented prompting framework for large language models that integrates temporal, spatial, behavioral history, and persona context, evaluated on the CASAS Aruba dataset with next-activity and multi-step rollout tasks. Key findings show strong intrinsic temporal understanding in LLMs even in zero-shot settings, with 1–2 demonstrations providing the best balance between accuracy and efficiency and diminishing returns beyond that; sequence-level DTW indicates coherent temporal alignment relative to baselines. The results suggest that pre-trained language models can serve as effective temporal reasoners for agent-based models and smart-environment simulations, enabling robust behavior modeling with minimal supervision and data requirements.

Abstract

Anticipating human activities and their durations is essential in applications such as smart-home automation, simulation-based architectural and urban design, activity-based transportation system simulation, and human-robot collaboration, where adaptive systems must respond to human activities. Existing data-driven agent-based models--from rule-based to deep learning--struggle in low-data environments, limiting their practicality. This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap by reasoning about everyday activities from compact contextual cues. We adopt a retrieval-augmented prompting strategy that integrates four sources of context--temporal, spatial, behavioral history, and persona--and evaluate it on the CASAS Aruba smart-home dataset. The evaluation spans two complementary tasks: next-activity prediction with duration estimation, and multi-step daily sequence generation, each tested with various numbers of few-shot examples provided in the prompt. Analyzing few-shot effects reveals how much contextual supervision is sufficient to balance data efficiency and predictive accuracy, particularly in low-data environments. Results show that large language models exhibit strong inherent temporal understanding of human behavior: even in zero-shot settings, they produce coherent daily activity predictions, while adding one or two demonstrations further refines duration calibration and categorical accuracy. Beyond a few examples, performance saturates, indicating diminishing returns. Sequence-level evaluation confirms consistent temporal alignment across few-shot conditions. These findings suggest that pre-trained language models can serve as promising temporal reasoners, capturing both recurring routines and context-dependent behavioral variations, thereby strengthening the behavioral modules of agent-based models.

Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 9 figures, 1 table)

This paper contains 24 sections, 3 equations, 9 figures, 1 table.

Introduction
Methodology
Prompting Framework
Evaluation Metrics
Case Study: CASAS Aruba
Dataset Overview
Experimental Setup on Aruba
Model and Implementation Details
Results
Next-Activity Prediction
Duration Prediction Quality
Joint Activity–Duration Performance
Interpretation
Effect on Class Balance and Generalization
Rapid Early Gain
...and 9 more sections

Figures (9)

Figure 1: Conceptual illustration of PT and D contribution to model performance under different data regimes.
Figure 2: Overview of the prompting framework, comprising three interconnected stages: retrieval pipeline, prompt construction, and LLM inference. Arrows indicate data flow and feedback connections between modules.
Figure 3: CASAS Aruba floor plan and sensor layout.
Figure 4: Median duration of activities across days of the week.
Figure 5: Activity transition matrices for weekday and weekend routines. Color intensity indicates the probability of transitioning from the activity on the Y-axis (current activity) to the activity on the X-axis (next activity).
...and 4 more figures

Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

TL;DR

Abstract

Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (9)