Table of Contents
Fetching ...

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

Gyuhyeon Seo, Jungwoo Yang, Junseong Pyo, Nalim Kim, Jonggeun Lee, Yohan Jo

TL;DR

SimuHome introduces a Matter-aligned, time-accelerated smart home simulator to realistically evaluate LLM agents on complex, multi-device tasks. It combines a 600-episode benchmark across 12 query types with simulator-based and LLM-judge evaluations to reveal strengths and bottlenecks, especially in latent intent inference and temporal scheduling. Findings show moderate gains for reasoning-enabled models at substantial latency costs, underscoring a crucial task-performance vs. practicality trade-off for real-time smart-home applications. The work positions SimuHome as a vital platform for developing state-aware, action-grounded agents with transferable capabilities to real Matter-compatible devices.

Abstract

Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and more. The main bottlenecks for developing smart home agents with such capabilities include the lack of a realistic simulation environment where agents can interact with devices and observe the results, as well as a challenging benchmark to evaluate them. To address this, we introduce $\textbf{SimuHome}$, a time-accelerated home environment that simulates smart devices, supports API calls, and reflects changes in environmental variables. By building the simulator on the Matter protocol, the global industry standard for smart home communication, SimuHome provides a high-fidelity environment, and agents validated in SimuHome can be deployed on real Matter-compliant devices with minimal adaptation. We provide a challenging benchmark of 600 episodes across twelve user query types that require the aforementioned capabilities. Our evaluation of 16 agents under a unified ReAct framework reveals distinct capabilities and limitations across models. Models under 7B parameters exhibited negligible performance across all query types. Even GPT-4.1, the best-performing standard model, struggled with implicit intent inference, state verification, and particularly temporal scheduling. While reasoning models such as GPT-5.1 consistently outperformed standard models on every query type, they required over three times the average inference time, which can be prohibitive for real-time smart home applications. This highlights a critical trade-off between task performance and real-world practicality.

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

TL;DR

SimuHome introduces a Matter-aligned, time-accelerated smart home simulator to realistically evaluate LLM agents on complex, multi-device tasks. It combines a 600-episode benchmark across 12 query types with simulator-based and LLM-judge evaluations to reveal strengths and bottlenecks, especially in latent intent inference and temporal scheduling. Findings show moderate gains for reasoning-enabled models at substantial latency costs, underscoring a crucial task-performance vs. practicality trade-off for real-time smart-home applications. The work positions SimuHome as a vital platform for developing state-aware, action-grounded agents with transferable capabilities to real Matter-compatible devices.

Abstract

Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and more. The main bottlenecks for developing smart home agents with such capabilities include the lack of a realistic simulation environment where agents can interact with devices and observe the results, as well as a challenging benchmark to evaluate them. To address this, we introduce , a time-accelerated home environment that simulates smart devices, supports API calls, and reflects changes in environmental variables. By building the simulator on the Matter protocol, the global industry standard for smart home communication, SimuHome provides a high-fidelity environment, and agents validated in SimuHome can be deployed on real Matter-compliant devices with minimal adaptation. We provide a challenging benchmark of 600 episodes across twelve user query types that require the aforementioned capabilities. Our evaluation of 16 agents under a unified ReAct framework reveals distinct capabilities and limitations across models. Models under 7B parameters exhibited negligible performance across all query types. Even GPT-4.1, the best-performing standard model, struggled with implicit intent inference, state verification, and particularly temporal scheduling. While reasoning models such as GPT-5.1 consistently outperformed standard models on every query type, they required over three times the average inference time, which can be prohibitive for real-time smart home applications. This highlights a critical trade-off between task performance and real-world practicality.

Paper Structure

This paper contains 50 sections, 1 equation, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The SimuHome home environment with Matter-compliant devices, featuring a GUI where users can arrange devices across rooms, configure their attributes, and evaluate agent reasoning for multi-device control.
  • Figure 2: Episode Generation Pipeline
  • Figure 3: Episode Evaluation Pipeline
  • Figure 4: Error type distributions of GPT-4.1 on QT2 and QT4.
  • Figure 5: Tool-call error patterns of four models on QT3-F. The left chart shows the average number of errors relative to the average number of tool calls in successful cases. The right chart shows the proportion of tasks achieved through first-try success versus those requiring error recovery.
  • ...and 2 more figures