Table of Contents
Fetching ...

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, Martin Vechev

TL;DR

The study probes the effectiveness of repository-level context files (e.g., AGENTS.md) for coding agents by introducing AGENTbench and comparing three settings: no context, LL-generated context, and developer-provided context. Across four agents and two benchmarks, LL-generated context tends to raise inference costs and slightly reduce success, while human-written context offers marginal performance gains at additional cost; context files rarely deliver useful repository overviews. Trace analyses show context presence increases testing, exploration, and tool use, indicating higher cognitive load. The authors advocate using minimal, task-focused guidance and identify opportunities to develop concise, automatic context-generation methods that better align with real-world task resolution. Overall, the work reveals a gap between practitioner guidance to include context files and empirical evidence of their limited practical value in current forms.

Abstract

A widespread practice in software development is to tailor coding agents to repositories using context files, such as AGENTS.md, by either manually or automatically generating them. Although this practice is strongly encouraged by agent developers, there is currently no rigorous investigation into whether such context files are actually effective for real-world tasks. In this work, we study this question and evaluate coding agents' task completion performance in two complementary settings: established SWE-bench tasks from popular repositories, with LLM-generated context files following agent-developer recommendations, and a novel collection of issues from repositories containing developer-committed context files. Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%. Behaviorally, both LLM-generated and developer-provided context files encourage broader exploration (e.g., more thorough testing and file traversal), and coding agents tend to respect their instructions. Ultimately, we conclude that unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

TL;DR

The study probes the effectiveness of repository-level context files (e.g., AGENTS.md) for coding agents by introducing AGENTbench and comparing three settings: no context, LL-generated context, and developer-provided context. Across four agents and two benchmarks, LL-generated context tends to raise inference costs and slightly reduce success, while human-written context offers marginal performance gains at additional cost; context files rarely deliver useful repository overviews. Trace analyses show context presence increases testing, exploration, and tool use, indicating higher cognitive load. The authors advocate using minimal, task-focused guidance and identify opportunities to develop concise, automatic context-generation methods that better align with real-world task resolution. Overall, the work reveals a gap between practitioner guidance to include context files and empirical evidence of their limited practical value in current forms.

Abstract

A widespread practice in software development is to tailor coding agents to repositories using context files, such as AGENTS.md, by either manually or automatically generating them. Although this practice is strongly encouraged by agent developers, there is currently no rigorous investigation into whether such context files are actually effective for real-world tasks. In this work, we study this question and evaluate coding agents' task completion performance in two complementary settings: established SWE-bench tasks from popular repositories, with LLM-generated context files following agent-developer recommendations, and a novel collection of issues from repositories containing developer-committed context files. Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%. Behaviorally, both LLM-generated and developer-provided context files encourage broader exploration (e.g., more thorough testing and file traversal), and coding agents tend to respect their instructions. Ultimately, we conclude that unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.
Paper Structure (54 sections, 12 figures, 3 tables)

This paper contains 54 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview of our evaluation pipeline. We begin with real-world repositories and tasks derived from past pull requests. For each repository state, we generate three settings: ① If a developer-provided context file exists, we include it in the repository. In ②, we omit the context file. ③ We use the coding agent's recommended settings to generate the context file. Then we pass the repository and context file to the coding agent and instruct it to autonomously resolve the task. We finally analyze the trace for behavioral changes and apply the generated patch to check for task resolution success.
  • Figure 2: Distribution of AGENTbench instances across 12 open-source GitHub repositories, each containing context files.
  • Figure 3: Resolution rate for 4 different models, without context files, with LLM-generated context files, and with developer-written context files, on SWE-bench Lite (left) and AGENTbench (right).
  • Figure 4: Number of steps before the first interaction between the agent and a file included in the PR patch (lower is better) is generally lower without context files than with LLM-generated context files or with developer-written context files (Human) on SWE-bench Lite (left) and AGENTbench (right).
  • Figure 5: When removing all documentation-related files from the codebase, LLM-generated context files tend to outperform developer-provided (Human) ones on AGENTbench.
  • ...and 7 more figures