Table of Contents
Fetching ...

HCAG: Hierarchical Abstraction and Retrieval-Augmented Generation on Theoretical Repositories with LLMs

Yusen Wu, Xiaotie Deng

Abstract

Existing Retrieval-Augmented Generation (RAG) methods for code struggle to capture the high-level architectural patterns and cross-file dependencies inherent in complex, theory-driven codebases, such as those in algorithmic game theory (AGT), leading to a persistent semantic and structural gap between abstract concepts and executable implementations. To address this challenge, we propose Hierarchical Code/Architecture-guided Agent Generation (HCAG), a framework that reformulates repository-level code generation as a structured, planning-oriented process over hierarchical knowledge. HCAG adopts a two-phase design: an offline hierarchical abstraction phase that recursively parses code repositories and aligned theoretical texts to construct a multi-resolution semantic knowledge base explicitly linking theory, architecture, and implementation; and an online hierarchical retrieval and scaffolded generation phase that performs top-down, level-wise retrieval to guide LLMs in an architecture-then-module generation paradigm. To further improve robustness and consistency, HCAG integrates a multi-agent discussion inspired by cooperative game. We provide a theoretical analysis showing that hierarchical abstraction with adaptive node compression achieves cost-optimality compared to flat and iterative RAG baselines. Extensive experiments on diverse game-theoretic system generation tasks demonstrate that HCAG substantially outperforms representative repository-level methods in code quality, architectural coherence, and requirement pass rate. In addition, HCAG produces a large-scale, aligned theory-implementation dataset that effectively enhances domain-specific LLMs through post-training. Although demonstrated in AGT, HCAG paradigm also offers a general blueprint for mining, reusing, and generating complex systems from structured codebases in other domains.

HCAG: Hierarchical Abstraction and Retrieval-Augmented Generation on Theoretical Repositories with LLMs

Abstract

Existing Retrieval-Augmented Generation (RAG) methods for code struggle to capture the high-level architectural patterns and cross-file dependencies inherent in complex, theory-driven codebases, such as those in algorithmic game theory (AGT), leading to a persistent semantic and structural gap between abstract concepts and executable implementations. To address this challenge, we propose Hierarchical Code/Architecture-guided Agent Generation (HCAG), a framework that reformulates repository-level code generation as a structured, planning-oriented process over hierarchical knowledge. HCAG adopts a two-phase design: an offline hierarchical abstraction phase that recursively parses code repositories and aligned theoretical texts to construct a multi-resolution semantic knowledge base explicitly linking theory, architecture, and implementation; and an online hierarchical retrieval and scaffolded generation phase that performs top-down, level-wise retrieval to guide LLMs in an architecture-then-module generation paradigm. To further improve robustness and consistency, HCAG integrates a multi-agent discussion inspired by cooperative game. We provide a theoretical analysis showing that hierarchical abstraction with adaptive node compression achieves cost-optimality compared to flat and iterative RAG baselines. Extensive experiments on diverse game-theoretic system generation tasks demonstrate that HCAG substantially outperforms representative repository-level methods in code quality, architectural coherence, and requirement pass rate. In addition, HCAG produces a large-scale, aligned theory-implementation dataset that effectively enhances domain-specific LLMs through post-training. Although demonstrated in AGT, HCAG paradigm also offers a general blueprint for mining, reusing, and generating complex systems from structured codebases in other domains.
Paper Structure (16 sections, 3 equations, 2 figures, 2 tables)

This paper contains 16 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Phase I Pipeline: Recursive Abstraction & Hierarchical Summarization. For each target codebase, HCAG recursively extracts logical summaries (green and yellow blocks) from the underlying textual content (orange blocks). At each level, the original content (index) and its summary are aggregated upward until the root node is reached. For nodes exceeding a configurable depth threshold, compression is applied (indicated by ellipsis): a placeholder label is generated, which can be expanded on-demand during later retrieval when the node is first accessed.
  • Figure 2: Phase II Pipeline: Hierarchical Retrieval and Scaffolded Generation. Given a user task instruction, the LLM evaluates the relevance of each retrieved node with respect to the task. If a node is fully relevant, its content is returned; if it is irrelevant or a leaf node, retrieval along that branch terminates. More commonly, a node is partially relevant, triggering task decomposition over its sub-nodes and recursive retrieval. Encountering a compressed placeholder node initiates on-demand expansion using the Phase I abstraction procedure to generate structured summary for the subtree. This recursive process continues(gray boxes denote parent nodes, green boxes denote child nodes with identical internal structure) until the structured codebase abstraction has been sufficiently traversed to support generation.