Table of Contents
Fetching ...

Gistify! Codebase-Level Understanding via Runtime Execution

Hyunji Lee, Minseon Kim, Chinmay Singh, Matheus Pereira, Atharv Sonwane, Isadora White, Elias Stengel-Eskin, Mohit Bansal, Zhengyan Shi, Alessandro Sordoni, Marc-Alexandre Côté, Xingdi Yuan, Lucas Caccia

TL;DR

Gistify addresses the challenge of codebase-level understanding by requiring a coding LLM to produce a single, self-contained gistified file that reproduces the runtime behavior of a given command on a large codebase. The authors formalize the task with four requirements—Self-Contained, Execution Fidelity, Minimalism, and Grounded Preservation—and establish an evaluation protocol and three metrics (Execution Fidelity, Line Existence Rate, Line Execution Rate), plus a Test $F_1$ Score for preserving the test function. Through experiments across multiple frameworks, models, and repositories, results show that even frontier models struggle to reliably solve Gistify, though strategies that provide global code context and runtime information yield measurable gains; agentic models with multi-step trajectories generally outperform static approaches. The work highlights the potential of gistified files as compact, executable artifacts for debugging and refactoring, while establishing Gistify as a challenging, generalizable benchmark that scales with repository complexity. Overall, the study demonstrates that runtime-aware reasoning and careful preservation of tests are crucial, and that leveraging global context and execution feedback can meaningfully improve codebase-level understanding in practice.

Abstract

As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.

Gistify! Codebase-Level Understanding via Runtime Execution

TL;DR

Gistify addresses the challenge of codebase-level understanding by requiring a coding LLM to produce a single, self-contained gistified file that reproduces the runtime behavior of a given command on a large codebase. The authors formalize the task with four requirements—Self-Contained, Execution Fidelity, Minimalism, and Grounded Preservation—and establish an evaluation protocol and three metrics (Execution Fidelity, Line Existence Rate, Line Execution Rate), plus a Test Score for preserving the test function. Through experiments across multiple frameworks, models, and repositories, results show that even frontier models struggle to reliably solve Gistify, though strategies that provide global code context and runtime information yield measurable gains; agentic models with multi-step trajectories generally outperform static approaches. The work highlights the potential of gistified files as compact, executable artifacts for debugging and refactoring, while establishing Gistify as a challenging, generalizable benchmark that scales with repository complexity. Overall, the study demonstrates that runtime-aware reasoning and careful preservation of tests are crucial, and that leveraging global context and execution feedback can meaningfully improve codebase-level understanding in practice.

Abstract

As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.

Paper Structure

This paper contains 48 sections, 3 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: The Gistify task: given a codebase and a command of entrypoint, the goal is to generate a minimal, self-contained gistified code file that faithfully reproduces the original runtime behavior using code from the given codebase.
  • Figure 2: Difficulty of the Gistify task is measured as a function of the execution trace difficulty of the underlying test.
  • Figure 3: Performance of a static coding LLM and various agentic coding LLMs (mini-SWE-Agent, SWE-Agnet, Copilot).
  • Figure 4: Base Prompt Template for Gistify Task.
  • Figure 5: Example of a model generating a parameter-specific gistified file when given a command that includes a parameter.
  • ...and 13 more figures