Program Decomposition and Translation with Static Analysis
Ali Reza Ibrahimzada
TL;DR
Large industry-scale software files exceed typical LLM context windows, hindering code-related tasks such as translation. The paper proposes method-level program decomposition via static analysis (CodeQL) to break files into independent method fragments and uses a bottom-up Call Graph–guided translation (with StarCoder) to handle large inputs. Results on 20 Apache projects (~60K methods) show a ~99.5% reduction in out-of-context cases and average fragment context usage of ~5% of a 2K window, enabling processing of large inputs; a qualitative CG-based translation of Apache Commons CLI demonstrates practical feasibility, with all files translated within ~3% context, though translation correctness was not validated. This approach enhances prompt engineering capability for LLMs in large-scale software engineering tasks and points to future work integrating additional decomposition techniques like slicing and broader dependency graphs.
Abstract
The rising popularity of Large Language Models (LLMs) has motivated exploring their use in code-related tasks. Code LLMs with more than millions of parameters are trained on a massive amount of code in different Programming Languages (PLs). Such models are used for automating various Software Engineering (SE) tasks using prompt engineering. However, given the very large size of industry-scale project files, a major issue of these LLMs is their limited context window size, motivating the question of "Can these LLMs process very large files and can we effectively perform prompt engineering?". Code translation aims to convert source code from one PL to another. In this work, we assess the effect of method-level program decomposition on context window of LLMs and investigate how this approach can enable translation of very large files which originally could not be done due to out-of-context issue. Our observations from 20 well-known java projects and approximately 60K methods suggest that method-level program decomposition significantly improves the limited context window problem of LLMs by 99.5%. Furthermore, our empirical analysis indicate that with method-level decomposition, each input fragment on average only consumes 5% of the context window, leaving more context space for prompt engineering and the output. Finally, we investigate the effectiveness of a Call Graph (CG) approach for translating very large files when doing method-level program decomposition.
