TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

Yikai Zhang; Siyu Yuan; Caiyu Hu; Kyle Richardson; Yanghua Xiao; Jiangjie Chen

TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

Yikai Zhang, Siyu Yuan, Caiyu Hu, Kyle Richardson, Yanghua Xiao, Jiangjie Chen

TL;DR

TimeArena introduces a time-aware textual simulation to evaluate multitasking language agents under realistic temporal and resource constraints. By modeling action durations, agent and object occupancy, and object contention across 30 tasks in cooking, household, and laboratory domains, it provides four evaluation metrics: Average Progress Score ($AS$), Completion Speed ($CS$), Task Completion Rate ($CR$), and Average Completion Time ($CT$), with progress computed as $s_i = (t_i / \sum_{j=1}^{n} t_j) \times 100\%$. Across seven LLMs including GPT-4, the study finds humans still outperform agents in parallel processing, signaling a gap in temporal awareness and multitask planning. TimeArena thus offers a benchmark for advancing temporally aware language agents, highlighting both the potential benefits of parallelism and the current limitations of state-of-the-art models, while also discussing methodological limitations and directions for future work.

Abstract

Despite remarkable advancements in emulating human-like behavior through Large Language Models (LLMs), current textual simulations do not adequately address the notion of time. To this end, we introduce TimeArena, a novel textual simulated environment that incorporates complex temporal dynamics and constraints that better reflect real-life planning scenarios. In TimeArena, agents are asked to complete multiple tasks as soon as possible, allowing for parallel processing to save time. We implement the dependency between actions, the time duration for each action, and the occupancy of the agent and the objects in the environment. TimeArena grounds to 30 real-world tasks in cooking, household activities, and laboratory work. We conduct extensive experiments with various state-of-the-art LLMs using TimeArena. Our findings reveal that even the most powerful models, e.g., GPT-4, still lag behind humans in effective multitasking, underscoring the need for enhanced temporal awareness in the development of language agents.

TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

TL;DR

), Completion Speed (

), Task Completion Rate (

), and Average Completion Time (

), with progress computed as

. Across seven LLMs including GPT-4, the study finds humans still outperform agents in parallel processing, signaling a gap in temporal awareness and multitask planning. TimeArena thus offers a benchmark for advancing temporally aware language agents, highlighting both the potential benefits of parallelism and the current limitations of state-of-the-art models, while also discussing methodological limitations and directions for future work.

Abstract

Paper Structure (42 sections, 8 figures, 9 tables, 1 algorithm)

This paper contains 42 sections, 8 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Simulation-based Evaluation For language Agents
Language Planning
Temporal Reasoning
TimeArena
Overview of TimeArena
An Example Run
Components of TimeArena
Tasks
Objects
Actions
The Interaction between Agent and Environment
Environmental Feedback
Progress Score
...and 27 more sections

Figures (8)

Figure 1: An example illustrating multitasking with temporal constraints in TimeArena. The completion of tasks requires actions in a predetermined dependency and order. Underlined actions do not occupy the agent, allowing other actions to be processed by the agent simultaneously. The Wait action skips the current time step, meaning the agent is idle.
Figure 2: An overview of TimeArena, with a multitasking example that shows our designs of the simulation. TimeArena first sets an objective for the agent, and then the agent interacts with TimeArena over time, with the design of task dependency, object occupancy, and agent occupancy.
Figure 3: The proportions of correct and incorrect actions of each language agent.
Figure 4: Comparison of the performance of GPT-4 with and without resource constraints. We imposed constraints by limiting to a single instance each of pot, fryer, and oven.
Figure 5: Task progress score curves of language agents on two task combinations in TimeArena. The names at the bottom-right indicate the scenario and task number. For example, cooking1 represents the first combination of tasks in the cooking scenario.
...and 3 more figures

TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

TL;DR

Abstract

TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)