A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management

Yang Liu; Li Zhang; Fang Liu; Ping Lin; Xinyi Li

A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management

Yang Liu, Li Zhang, Fang Liu, Ping Lin, Xinyi Li

TL;DR

LoCoEval is presented, the first long-horizon conversational context management benchmark tailored to repository-oriented development scenarios, and an improved method integrating conversational and repository information into a unified memory, which outperforms all baselines and demonstrates robustness.

Abstract

In recent years, large language models (LLMs) have advanced rapidly, substantially enhancing their code understanding and generation capabilities and giving rise to powerful code assistants. However, in practical repository development, excessively long-horizon conversational context may overwhelm models, causing the loss of critical information and degraded performance, thereby limiting the utility of code assistants. Existing context management methods proposed to mitigate this context dilemma primarily target general-purpose conversations, while repository-oriented solutions remain largely unexplored, which is largely due to the lack of reliable evaluation benchmarks. To bridge this gap, we present LoCoEval, the first long-horizon conversational context management benchmark tailored to repository-oriented development scenarios. Adhering to three key principles, LoCoEval is constructed via an LLM-driven pipeline that generates realistic and diverse repository-oriented conversations, capturing key interaction patterns such as iterative requirements, noisy input, and retrospective questions. We evaluate 7 baselines, including 4 representative context management methods, using 3 advanced backbone LLMs on LoCoEval. The results reveal substantial challenges faced by standalone LLMs and existing approaches, especially memory systems, in repository-oriented conversational scenarios. To address these limitations, we further propose an improved method integrating conversational and repository information into a unified memory, which outperforms all baselines (*Oracle* excluded) and demonstrates robustness. Additionally, we investigated the impact of various factors on method performance, providing actionable insights for future research.

A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management

TL;DR

Abstract

Paper Structure (30 sections, 3 equations, 5 figures, 6 tables)

This paper contains 30 sections, 3 equations, 5 figures, 6 tables.

Introduction
Related Work
Conversational Context Management for LLMs
Conversational Context Management Benchmark
Repository-Oriented Benchmark
Approach
Automated Construction of LoCoEval
Sample Selection
Information Items Extraction and Mutation
Query Outline Skeleton Construction
Query Outline Population
LoCoEval Benchmark
Distribution Patterns
Tasks and Evaluation Metrics
Evaluation Framework
...and 15 more sections

Figures (5)

Figure 1: An example of LoCoEval. All mock user queries and agent responses are dynamically generated during evaluation.
Figure 2: Overview of the construction pipeline of LoCoEval.
Figure 3: Overview of the evaluation framework of LoCoEval.
Figure 4: Agent workflow of our improved Mem0$^\mathcal{R}$.
Figure 5: Trends of the normalized pass@1 on the function generation task for different agents, with respect to the number of this task $k$ per sample (left) and the interval of conversation length $l$ per sample (right).

Theorems & Definitions (9)

Definition 3.1: ground-truth information item
Definition 3.2: distracting information item
Definition 3.3: information item unit
Definition 3.4: prerequisite relation
Definition 3.5: information item dependency graph
Definition 3.6: query item
Definition 3.7: topic
Definition 3.8: query outline
Definition 3.9: recap query item

A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management

TL;DR

Abstract

A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (9)