Gistify! Codebase-Level Understanding via Runtime Execution
Hyunji Lee, Minseon Kim, Chinmay Singh, Matheus Pereira, Atharv Sonwane, Isadora White, Elias Stengel-Eskin, Mohit Bansal, Zhengyan Shi, Alessandro Sordoni, Marc-Alexandre Côté, Xingdi Yuan, Lucas Caccia
TL;DR
Gistify addresses the challenge of codebase-level understanding by requiring a coding LLM to produce a single, self-contained gistified file that reproduces the runtime behavior of a given command on a large codebase. The authors formalize the task with four requirements—Self-Contained, Execution Fidelity, Minimalism, and Grounded Preservation—and establish an evaluation protocol and three metrics (Execution Fidelity, Line Existence Rate, Line Execution Rate), plus a Test $F_1$ Score for preserving the test function. Through experiments across multiple frameworks, models, and repositories, results show that even frontier models struggle to reliably solve Gistify, though strategies that provide global code context and runtime information yield measurable gains; agentic models with multi-step trajectories generally outperform static approaches. The work highlights the potential of gistified files as compact, executable artifacts for debugging and refactoring, while establishing Gistify as a challenging, generalizable benchmark that scales with repository complexity. Overall, the study demonstrates that runtime-aware reasoning and careful preservation of tests are crucial, and that leveraging global context and execution feedback can meaningfully improve codebase-level understanding in practice.
Abstract
As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.
