CodeMapper: A Language-Agnostic Approach to Mapping Code Regions Across Commits
Huimin Hu, Michael Pradel
TL;DR
CodeMapper tackles the code mapping problem: given a code region in one commit, it finds the corresponding region in another commit across arbitrary languages. It combines three phase-1 techniques—diff-based candidate extraction, movement detection, and text search—with a phase-2, context-aware Levenshtein similarity to select the best target region. The approach is evaluated on four diverse datasets, including hand-annotated Data A/B, a suppression-study dataset, and CodeTracker-derived Java data, showing higher exact-match rates than baselines and robust performance across languages. With interactive runtimes and clear practical value for developers and empirical studies, CodeMapper advances the ability to focus on specific code regions during software evolution.
Abstract
During software evolution, developers commonly face the problem of mapping a specific code region from one commit to another. For example, they may want to determine how the condition of an if-statement, a specific line in a configuration file, or the definition of a function changes. We call this the code mapping problem. Existing techniques, such as git diff, address this problem only insufficiently because they show all changes made to a file instead of focusing on a code region of the developer's choice. Other techniques focus on specific code elements and programming languages (e.g., methods in Java), limiting their applicability. This paper introduces CodeMapper, an approach to address the code mapping problem in a way that is independent of specific program elements and programming languages. Given a code region in one commit, CodeMapper finds the corresponding region in another commit. The approach consists of two phases: (i) computing candidate regions by analyzing diffs, detecting code movements, and searching for specific code fragments, and (ii) selecting the most likely target region by calculating similarities. Our evaluation applies CodeMapper to four datasets, including two new hand-annotated datasets containing code region pairs in ten popular programming languages. CodeMapper correctly identifies the expected target region in 71.0%--94.5% of all cases, improving over the best available baselines by 1.5--58.8 absolute percent points.
