Table of Contents
Fetching ...

Free and Customizable Code Documentation with LLMs: A Fine-Tuning Approach

Sayak Chakrabarty, Souradip Pal

TL;DR

This paper addresses the challenge of automatic code documentation for public repositories by introducing a large-language-model-based tool that generates README content and GitHub pages. It leverages a retrieval-augmented generation pipeline with open-source models to reduce API costs and enables fine-tuning via QLoRA for user-specific datasets. The methodology combines code indexing, embedding-based retrieval (HNSW), and LangChain prompt engineering to produce context-aware documentation. While not state-of-the-art, the work demonstrates practical utility for developers by lowering manual effort and enabling easy deployment and integration.

Abstract

Automated documentation of programming source code is a challenging task with significant practical and scientific implications for the developer community. We present a large language model (LLM)-based application that developers can use as a support tool to generate basic documentation for any publicly available repository. Over the last decade, several papers have been written on generating documentation for source code using neural network architectures. With the recent advancements in LLM technology, some open-source applications have been developed to address this problem. However, these applications typically rely on the OpenAI APIs, which incur substantial financial costs, particularly for large repositories. Moreover, none of these open-source applications offer a fine-tuned model or features to enable users to fine-tune. Additionally, finding suitable data for fine-tuning is often challenging. Our application addresses these issues which is available at https://pypi.org/project/readme-ready/.

Free and Customizable Code Documentation with LLMs: A Fine-Tuning Approach

TL;DR

This paper addresses the challenge of automatic code documentation for public repositories by introducing a large-language-model-based tool that generates README content and GitHub pages. It leverages a retrieval-augmented generation pipeline with open-source models to reduce API costs and enables fine-tuning via QLoRA for user-specific datasets. The methodology combines code indexing, embedding-based retrieval (HNSW), and LangChain prompt engineering to produce context-aware documentation. While not state-of-the-art, the work demonstrates practical utility for developers by lowering manual effort and enabling easy deployment and integration.

Abstract

Automated documentation of programming source code is a challenging task with significant practical and scientific implications for the developer community. We present a large language model (LLM)-based application that developers can use as a support tool to generate basic documentation for any publicly available repository. Over the last decade, several papers have been written on generating documentation for source code using neural network architectures. With the recent advancements in LLM technology, some open-source applications have been developed to address this problem. However, these applications typically rely on the OpenAI APIs, which incur substantial financial costs, particularly for large repositories. Moreover, none of these open-source applications offer a fine-tuned model or features to enable users to fine-tune. Additionally, finding suitable data for fine-tuning is often challenging. Our application addresses these issues which is available at https://pypi.org/project/readme-ready/.

Paper Structure

This paper contains 14 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Input to Output Workflow showing the Retrieval and Generator modules. The retrieval module uses HNSW algorithm to create a context for the prompt to the Language model for text generation.