Table of Contents
Fetching ...

MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

Darshan Deshpande, Varun Gangal, Hersh Mehta, Anand Kannappan, Rebecca Qian, Peng Wang

TL;DR

MEMTRACK addresses the lack of memory benchmarking in multi-platform, enterprise-like agent environments. It constructs 47 timeline-based scenarios spanning Slack, Linear, and Git to probe long-horizon memory and cross-platform reasoning, using three data-collection methods and three memory configurations. The benchmark introduces Correctness, Efficiency, and Redundancy as core metrics to evaluate memory-centric performance beyond simple QA. Experiments with state-of-the-art LLMs show substantial gaps in multi-platform memory reasoning, with GPT-5 achieving roughly 60% Correctness and memory backends offering limited gains, highlighting the need for more effective memory architectures and retrieval strategies. Overall, MEMTRACK provides an extensible framework for memory-augmented agents that goes beyond conversational benchmarks and enables future multi-agent, multi-platform evaluation in complex organizational workflows.

Abstract

Recent works on context and memory benchmarking have primarily focused on conversational instances but the need for evaluating memory in dynamic enterprise environments is crucial for its effective application. We introduce MEMTRACK, a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments. MEMTRACK models realistic organizational workflows by integrating asynchronous events across multiple communication and productivity platforms such as Slack, Linear and Git. Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information as well as potential codebase/file-system comprehension and exploration. Consequently, our benchmark tests memory capabilities such as acquistion, selection and conflict resolution. We curate the MEMTRACK dataset through both manual expert driven design and scalable agent based synthesis, generating ecologically valid scenarios grounded in real world software development processes. We introduce pertinent metrics for Correctness, Efficiency, and Redundancy that capture the effectiveness of memory mechanisms beyond simple QA performance. Experiments across SoTA LLMs and memory backends reveal challenges in utilizing memory across long horizons, handling cross-platform dependencies, and resolving contradictions. Notably, the best performing GPT-5 model only achieves a 60\% Correctness score on MEMTRACK. This work provides an extensible framework for advancing evaluation research for memory-augmented agents, beyond existing focus on conversational setups, and sets the stage for multi-agent, multi-platform memory benchmarking in complex organizational settings

MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

TL;DR

MEMTRACK addresses the lack of memory benchmarking in multi-platform, enterprise-like agent environments. It constructs 47 timeline-based scenarios spanning Slack, Linear, and Git to probe long-horizon memory and cross-platform reasoning, using three data-collection methods and three memory configurations. The benchmark introduces Correctness, Efficiency, and Redundancy as core metrics to evaluate memory-centric performance beyond simple QA. Experiments with state-of-the-art LLMs show substantial gaps in multi-platform memory reasoning, with GPT-5 achieving roughly 60% Correctness and memory backends offering limited gains, highlighting the need for more effective memory architectures and retrieval strategies. Overall, MEMTRACK provides an extensible framework for memory-augmented agents that goes beyond conversational benchmarks and enables future multi-agent, multi-platform evaluation in complex organizational workflows.

Abstract

Recent works on context and memory benchmarking have primarily focused on conversational instances but the need for evaluating memory in dynamic enterprise environments is crucial for its effective application. We introduce MEMTRACK, a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments. MEMTRACK models realistic organizational workflows by integrating asynchronous events across multiple communication and productivity platforms such as Slack, Linear and Git. Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information as well as potential codebase/file-system comprehension and exploration. Consequently, our benchmark tests memory capabilities such as acquistion, selection and conflict resolution. We curate the MEMTRACK dataset through both manual expert driven design and scalable agent based synthesis, generating ecologically valid scenarios grounded in real world software development processes. We introduce pertinent metrics for Correctness, Efficiency, and Redundancy that capture the effectiveness of memory mechanisms beyond simple QA performance. Experiments across SoTA LLMs and memory backends reveal challenges in utilizing memory across long horizons, handling cross-platform dependencies, and resolving contradictions. Notably, the best performing GPT-5 model only achieves a 60\% Correctness score on MEMTRACK. This work provides an extensible framework for advancing evaluation research for memory-augmented agents, beyond existing focus on conversational setups, and sets the stage for multi-agent, multi-platform memory benchmarking in complex organizational settings

Paper Structure

This paper contains 38 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: MemTrack's Data Generation and Evaluation procedure is divided into four parts: task generation, injection of events into containerized environment, agent execution monitoring and sequential question injection, and performance evaluation