Table of Contents
Fetching ...

Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding

Ishraq Khan, Assad Chowdary, Sharoz Haseeb, Urvish Patel, Yousuf Zaii

TL;DR

Chronos-1 introduces a debugging-focused language model designed for repository-scale code understanding, combining persistent memory, adaptive graph-guided retrieval (AGR), and a seven-layer autonomous fix-test-refine loop. It demonstrates state-of-the-art performance on debugging benchmarks, notably 80.33% on SWE-bench Lite and 65.3% fix success across 5,000 real-world bugs, driven by Continuous memory (PDM), robust multi-hop context retrieval, and execution-feedback loops. The work provides extensive ablations, theoretical guarantees, and adversarial analyses, establishing a specialized paradigm that bridges memory, reasoning, and automated testing to outperform general frontier models on debugging tasks. The Chronos-1 architecture promises practical impact by enabling autonomous maintenance within CI/CD and IDE ecosystems, with plans for OS and API deployment in 2025–2026. It also outlines limitations and avenues for future work, including hardware-dependent and cross-language bugs, safety considerations, and broader adoption in production environments.

Abstract

Large Language Models (LLMs) have advanced code generation and software automation but remain constrained by inference-time context and lack structured reasoning over code, leaving debugging largely unsolved. While Claude 4.5 Opus achieves 74.40% on SWE-bench Verified and Gemini 3 Pro reaches 76.2%, both models remain below 20% on real multi-file debugging tasks. We introduce Kodezi Chronos-1, a language model purpose-built for debugging that integrates Adaptive Graph-Guided Retrieval to navigate codebases up to 10 million lines (92% precision, 85% recall), Persistent Debug Memory trained on over 15 million sessions, and a seven-layer fix-test-refine architecture. On 5,000 real-world scenarios, Chronos-1 achieves 67.3% +/- 2.1% fix accuracy compared to 14.2% +/- 1.3% for Claude 4.1 Opus and 13.8% +/- 1.2% for GPT-4.1 (Cohen's d = 3.87). On SWE-bench Lite, Chronos-1 reaches a state-of-the-art 80.33% resolution rate (241 of 300), outperforming the next best system by 20 points and achieving repository-specific highs of 96.1% on Sympy and 90.4% on Django. Chronos-1 reduces debugging time by 40% and iterations by 65%, resolving complex multi-file and cross-repository bugs that require temporal analysis. Limitations remain for hardware-dependent and dynamic language errors, and Chronos-1 will be available in Kodezi OS in Q4 2025 and via API in Q1 2026.

Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding

TL;DR

Chronos-1 introduces a debugging-focused language model designed for repository-scale code understanding, combining persistent memory, adaptive graph-guided retrieval (AGR), and a seven-layer autonomous fix-test-refine loop. It demonstrates state-of-the-art performance on debugging benchmarks, notably 80.33% on SWE-bench Lite and 65.3% fix success across 5,000 real-world bugs, driven by Continuous memory (PDM), robust multi-hop context retrieval, and execution-feedback loops. The work provides extensive ablations, theoretical guarantees, and adversarial analyses, establishing a specialized paradigm that bridges memory, reasoning, and automated testing to outperform general frontier models on debugging tasks. The Chronos-1 architecture promises practical impact by enabling autonomous maintenance within CI/CD and IDE ecosystems, with plans for OS and API deployment in 2025–2026. It also outlines limitations and avenues for future work, including hardware-dependent and cross-language bugs, safety considerations, and broader adoption in production environments.

Abstract

Large Language Models (LLMs) have advanced code generation and software automation but remain constrained by inference-time context and lack structured reasoning over code, leaving debugging largely unsolved. While Claude 4.5 Opus achieves 74.40% on SWE-bench Verified and Gemini 3 Pro reaches 76.2%, both models remain below 20% on real multi-file debugging tasks. We introduce Kodezi Chronos-1, a language model purpose-built for debugging that integrates Adaptive Graph-Guided Retrieval to navigate codebases up to 10 million lines (92% precision, 85% recall), Persistent Debug Memory trained on over 15 million sessions, and a seven-layer fix-test-refine architecture. On 5,000 real-world scenarios, Chronos-1 achieves 67.3% +/- 2.1% fix accuracy compared to 14.2% +/- 1.3% for Claude 4.1 Opus and 13.8% +/- 1.2% for GPT-4.1 (Cohen's d = 3.87). On SWE-bench Lite, Chronos-1 reaches a state-of-the-art 80.33% resolution rate (241 of 300), outperforming the next best system by 20 points and achieving repository-specific highs of 96.1% on Sympy and 90.4% on Django. Chronos-1 reduces debugging time by 40% and iterations by 65%, resolving complex multi-file and cross-repository bugs that require temporal analysis. Limitations remain for hardware-dependent and dynamic language errors, and Chronos-1 will be available in Kodezi OS in Q4 2025 and via API in Q1 2026.

Paper Structure

This paper contains 117 sections, 3 theorems, 2 equations, 23 figures, 44 tables, 2 algorithms.

Key Result

Theorem 1

The time complexity of AGR is $O(k_{max} \cdot |S| \cdot d^{k_{max}} \cdot \log(d^{k_{max}}))$ where $|S|$ is the number of seed nodes and $k_{max}$ is the maximum hop depth.

Figures (23)

  • Figure 1: Token distribution in debugging tasks: Unlike typical LLM applications where input dominates, debugging requires substantial, high-quality output generation.
  • Figure 2: Debugging capability comparison across eight key factors. Chronos-1 (green) significantly outperforms all frontier models including Claude 4.5 Opus (red), Gemini 3 Pro (cyan), Claude 4.1 Opus (blue), and GPT-4.1 (orange). Despite improvements in newer models, Chronos-1maintains substantial advantages in test integration (92%), iteration speed (95%), and cost efficiency (91%).
  • Figure 3: Complete fix loop lifecycle showing integration between PDM, AGR retrieval, and iterative refinement. Dashed lines indicate feedback mechanisms that enable learning across debugging sessions.
  • Figure 4: High-level overview of Chronos-1: Memory-driven embedding and retrieval powering autonomous reasoning and codebase management.
  • Figure 5: Graph-structured memory indexing in Kodezi Chronos-1: code, documentation, and test elements as nodes, with functional relationships as edges.
  • ...and 18 more figures

Theorems & Definitions (5)

  • Theorem 1: AGR Retrieval Complexity
  • proof
  • Theorem 2: Confidence Convergence
  • proof
  • Lemma 1: Path Cost Bound