Table of Contents
Fetching ...

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

Anh Nguyen Hoang, Minh Le-Anh, Bach Le, Nghi D. Q. Bui

TL;DR

CodeWiki tackles the problem of generating holistic, architecture-aware repository-level documentation for large, evolving codebases. It introduces a hierarchical decomposition framework, a recursive, semi-agentic generation process with dynamic task delegation, and multi-modal synthesis to produce textual and architectural artifacts, evaluated with CodeWikiBench. The approach yields a substantial improvement over a closed-source baseline (68.79% vs 64.06%) and demonstrates strong cross-language and scalability performance, particularly for high-level languages. By open-sourcing CodeWiki and CodeWikiBench, the work aims to accelerate adoption and further research in automated, architecture-aware software documentation.

Abstract

Given a large and evolving codebase, the ability to automatically generate holistic, architecture-aware documentation that captures not only individual functions but also cross-file, cross-module, and system-level interactions remains an open challenge. Comprehensive documentation is essential for long-term software maintenance and collaboration, yet current automated approaches still fail to model the rich semantic dependencies and architectural structures that define real-world software systems. We present \textbf{CodeWiki}, a unified framework for automated repository-level documentation across seven programming languages. CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi-agent processing with dynamic task delegation for scalable generation, and (iii) multi-modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. To enable rigorous evaluation, we introduce \textbf{CodeWikiBench}, a comprehensive benchmark featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results show that CodeWiki achieves a 68.79\% quality score with proprietary models, outperforming the closed-source DeepWiki baseline (64.06\%) by 4.73\%, with particularly strong improvements on high-level scripting languages (+10.47\%). We open-source CodeWiki to foster future research and community adoption.

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

TL;DR

CodeWiki tackles the problem of generating holistic, architecture-aware repository-level documentation for large, evolving codebases. It introduces a hierarchical decomposition framework, a recursive, semi-agentic generation process with dynamic task delegation, and multi-modal synthesis to produce textual and architectural artifacts, evaluated with CodeWikiBench. The approach yields a substantial improvement over a closed-source baseline (68.79% vs 64.06%) and demonstrates strong cross-language and scalability performance, particularly for high-level languages. By open-sourcing CodeWiki and CodeWikiBench, the work aims to accelerate adoption and further research in automated, architecture-aware software documentation.

Abstract

Given a large and evolving codebase, the ability to automatically generate holistic, architecture-aware documentation that captures not only individual functions but also cross-file, cross-module, and system-level interactions remains an open challenge. Comprehensive documentation is essential for long-term software maintenance and collaboration, yet current automated approaches still fail to model the rich semantic dependencies and architectural structures that define real-world software systems. We present \textbf{CodeWiki}, a unified framework for automated repository-level documentation across seven programming languages. CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi-agent processing with dynamic task delegation for scalable generation, and (iii) multi-modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. To enable rigorous evaluation, we introduce \textbf{CodeWikiBench}, a comprehensive benchmark featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results show that CodeWiki achieves a 68.79\% quality score with proprietary models, outperforming the closed-source DeepWiki baseline (64.06\%) by 4.73\%, with particularly strong improvements on high-level scripting languages (+10.47\%). We open-source CodeWiki to foster future research and community adoption.

Paper Structure

This paper contains 52 sections, 11 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: CodeWiki Framework Architecture Overview. The framework operates in three main phases: (1) Repository analysis through AST/LLM parsing to construct dependency graphs and identify high-level components, followed by hierarchical decomposition into manageable modules; (2) Recursive documentation generation where specialized agents process leaf modules with dynamic delegation capabilities, creating markdown documentation while maintaining cross-module references; (3) Hierarchical assembly where parent modules are synthesized from child documentation using LLM-based synthesis, culminating in comprehensive repository overview documentation. The module tree evolves throughout the process, enabling scalable processing of repositories of arbitrary size.
  • Figure 2: Example hierarchical evaluation rubric for the RAGFlow repository. The rubric mirrors the project's architectural structure with weighted requirements at multiple levels. Leaf nodes represent specific requirements assessed by Judge Agent, while parent scores are computed through weighted aggregation. The hierarchical organization ensures comprehensive coverage from high-level architectural components down to specific implementation details.
  • Figure 3: Performance comparison by programming language categories. CodeWiki-sonnet-4 shows strong improvements in high-level scripting languages (+10.47%) and managed/enterprise languages (+4.04%), while systems programming languages present challenges for both systems.
  • Figure 4: Example repository-level documentation generated by CodeWiki for the All-Hands-AI--OpenHands repository. The documentation includes a comprehensive overview, architectural diagrams showing the modular event-driven architecture, and hierarchical navigation of core components including the agent system, implementations, and event processing modules.