Table of Contents
Fetching ...

A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

Waleed Khalid, Dmitry Ignatov, Radu Timofte

TL;DR

NN-RAG tackles the fragmentation of PyTorch code across repositories by introducing a retrieval-augmented pipeline that constructs dependency-closed, executable neural modules with provenance. It emphasizes import-preserving regeneration and validator-gated promotion, using neutral specifications to optionally guide LLM-based synthesis without redistributing code. On 19 repositories, it extracts 1,289 blocks and validates 941 as runnable, uncovering that a majority of unique architectures originate from NN-RAG and enabling cross-repository migration of designs. The approach enhances reproducibility, scalability, and transparency in neural-architecture reuse, providing a practical substrate for ablations and architectural discovery while avoiding the redistribution of third-party weights.

Abstract

Reusing existing neural-network components is central to research efficiency, yet discovering, extracting, and validating such modules across thousands of open-source repositories remains difficult. We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorch codebases into a searchable and executable library of validated neural modules. Unlike conventional code search or clone-detection tools, NN-RAG performs scope-aware dependency resolution, import-preserving reconstruction, and validator-gated promotion -- ensuring that every retrieved block is scope-closed, compilable, and runnable. Applied to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique. Through multi-level de-duplication (exact, lexical, structural), we find that NN-RAG contributes the overwhelming majority of unique architectures to the LEMUR dataset, supplying approximately 72% of all novel network structures. Beyond quantity, NN-RAG uniquely enables cross-repository migration of architectural patterns, automatically identifying reusable modules in one project and regenerating them, dependency-complete, in another context. To our knowledge, no other open-source system provides this capability at scale. The framework's neutral specifications further allow optional integration with language models for synthesis or dataset registration without redistributing third-party code. Overall, NN-RAG transforms fragmented vision code into a reproducible, provenance-tracked substrate for algorithmic discovery, offering a first open-source solution that both quantifies and expands the diversity of executable neural architectures across repositories.

A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

TL;DR

NN-RAG tackles the fragmentation of PyTorch code across repositories by introducing a retrieval-augmented pipeline that constructs dependency-closed, executable neural modules with provenance. It emphasizes import-preserving regeneration and validator-gated promotion, using neutral specifications to optionally guide LLM-based synthesis without redistributing code. On 19 repositories, it extracts 1,289 blocks and validates 941 as runnable, uncovering that a majority of unique architectures originate from NN-RAG and enabling cross-repository migration of designs. The approach enhances reproducibility, scalability, and transparency in neural-architecture reuse, providing a practical substrate for ablations and architectural discovery while avoiding the redistribution of third-party weights.

Abstract

Reusing existing neural-network components is central to research efficiency, yet discovering, extracting, and validating such modules across thousands of open-source repositories remains difficult. We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorch codebases into a searchable and executable library of validated neural modules. Unlike conventional code search or clone-detection tools, NN-RAG performs scope-aware dependency resolution, import-preserving reconstruction, and validator-gated promotion -- ensuring that every retrieved block is scope-closed, compilable, and runnable. Applied to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique. Through multi-level de-duplication (exact, lexical, structural), we find that NN-RAG contributes the overwhelming majority of unique architectures to the LEMUR dataset, supplying approximately 72% of all novel network structures. Beyond quantity, NN-RAG uniquely enables cross-repository migration of architectural patterns, automatically identifying reusable modules in one project and regenerating them, dependency-complete, in another context. To our knowledge, no other open-source system provides this capability at scale. The framework's neutral specifications further allow optional integration with language models for synthesis or dataset registration without redistributing third-party code. Overall, NN-RAG transforms fragmented vision code into a reproducible, provenance-tracked substrate for algorithmic discovery, offering a first open-source solution that both quantifies and expands the diversity of executable neural architectures across repositories.

Paper Structure

This paper contains 16 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: System architecture showing the five core components—BlockDiscovery, BlockExtractor, FileIndexStore, BlockValidator, and RepoCache—and their data-flow relations.
  • Figure 2: Distribution of extracted neural network blocks across major repositories, demonstrating comprehensive coverage of the PyTorch ecosystem.
  • Figure 3: Seven-phase extraction pipeline from automated block discovery to validation, showing the flow and output split between validated (941) and failed (348) blocks under the current configuration.
  • Figure 4: Extraction and validation statistics showing 100% extraction and 73% validation across 1,289 targets.
  • Figure 5: Processing indicators and generated code volume. Caching and concurrency stabilize iteration time even as the corpus grows.
  • ...and 2 more figures