Table of Contents
Fetching ...

Hyperparameter Optimization for AST Differencing

Matias Martinez, Jean-Rémy Falleri, Martin Monperrus

TL;DR

The paper addresses how hyperparameters in AST differencing algorithms can degrade or improve diff quality, and introduces Diff Auto Tuning (DAT) to optimize GumTree configurations in a data-driven manner. DAT employs Grid Search, Hyperopt, and Optuna to minimize edit-script length across a training set, with global optimization learning language/meta-model defaults and local optimization tailoring results to individual file pairs. Empirical results show substantial gains: global optimization improves 21.8% (JDT) and 16.1% (Spoon) cases, while local optimization yields up to 27.4% improvements, with Hyperopt/Optuna offering faster search than exhaustive grid search. The work provides a public tool and protocol, demonstrating that hyperparameter tuning is broadly applicable to AST differencing and can be used to produce shorter, more understandable edit-scripts in practice.

Abstract

Computing the differences between two versions of the same program is an essential task for software development and software evolution research. AST differencing is the most advanced way of doing so, and an active research area. Yet, AST differencing algorithms rely on configuration parameters that may have a strong impact on their effectiveness. In this paper, we present a novel approach named DAT (Diff Auto Tuning) for hyperparameter optimization of AST differencing. We thoroughly state the problem of hyper-configuration for AST differencing. We evaluate our data-driven approach DAT to optimize the edit-scripts generated by the state-of-the-art AST differencing algorithm named GumTree in different scenarios. DAT is able to find a new configuration for GumTree that improves the edit-scripts in 21.8% of the evaluated cases.

Hyperparameter Optimization for AST Differencing

TL;DR

The paper addresses how hyperparameters in AST differencing algorithms can degrade or improve diff quality, and introduces Diff Auto Tuning (DAT) to optimize GumTree configurations in a data-driven manner. DAT employs Grid Search, Hyperopt, and Optuna to minimize edit-script length across a training set, with global optimization learning language/meta-model defaults and local optimization tailoring results to individual file pairs. Empirical results show substantial gains: global optimization improves 21.8% (JDT) and 16.1% (Spoon) cases, while local optimization yields up to 27.4% improvements, with Hyperopt/Optuna offering faster search than exhaustive grid search. The work provides a public tool and protocol, demonstrating that hyperparameter tuning is broadly applicable to AST differencing and can be used to produce shorter, more understandable edit-scripts in practice.

Abstract

Computing the differences between two versions of the same program is an essential task for software development and software evolution research. AST differencing is the most advanced way of doing so, and an active research area. Yet, AST differencing algorithms rely on configuration parameters that may have a strong impact on their effectiveness. In this paper, we present a novel approach named DAT (Diff Auto Tuning) for hyperparameter optimization of AST differencing. We thoroughly state the problem of hyper-configuration for AST differencing. We evaluate our data-driven approach DAT to optimize the edit-scripts generated by the state-of-the-art AST differencing algorithm named GumTree in different scenarios. DAT is able to find a new configuration for GumTree that improves the edit-scripts in 21.8% of the evaluated cases.

Paper Structure

This paper contains 51 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (Case 1) Example of spurious add-remove edits found by GumTree using the default configuration and JDT meta-model. Other no-spurious edits are not presented in the figure.
  • Figure 2: (Case 2) Visualization of the edits computed by GumTree using default configuration.
  • Figure 3: Workflow of DAT. It searches for the best AST differencing configuration in a data-driven manner, according to a set of file-pair.
  • Figure 4: Distribution of the percentage of reduction of the edit-script length using the optimized configuration w.r.t. the default configuration. (Cases with no improvement or detriment are ignored).
  • Figure 5: Distribution of AST sizes (# of nodes).
  • ...and 1 more figures