Table of Contents
Fetching ...

Can Large Language Models Improve SE Active Learning via Warm-Starts?

Lohith Senthilkumar, Tim Menzies

TL;DR

This paper tackles data scarcity in software engineering optimization by evaluating whether large language models can generate effective warm-starts for active learning. It introduces a MOOT-based empirical study across 49 multi-objective SE tasks, comparing LLM-generated warm starts against Gaussian Process Models and Tree of Parzen Estimators, with a rigorous evaluation using Chebyshev distance and Scott-Knott statistics. Results show that LLMs substantially reduce labeling effort and improve outcomes for low- and medium-dimensional problems, but their advantage diminishes in high-dimensional settings where Bayesian methods excel. The work demonstrates the potential of LLMs to complement traditional Bayesian approaches in SE optimization and provides an open, reproducible benchmark for further research into warm-start strategies and dimensionality-aware learning.

Abstract

When SE data is scarce, "active learners" use models learned from tiny samples of the data to find the next most informative example to label. In this way, effective models can be generated using very little data. For multi-objective software engineering (SE) tasks, active learning can benefit from an effective set of initial guesses (also known as "warm starts"). This paper explores the use of Large Language Models (LLMs) for creating warm-starts. Those results are compared against Gaussian Process Models and Tree of Parzen Estimators. For 49 SE tasks, LLM-generated warm starts significantly improved the performance of low- and medium-dimensional tasks. However, LLM effectiveness diminishes in high-dimensional problems, where Bayesian methods like Gaussian Process Models perform best.

Can Large Language Models Improve SE Active Learning via Warm-Starts?

TL;DR

This paper tackles data scarcity in software engineering optimization by evaluating whether large language models can generate effective warm-starts for active learning. It introduces a MOOT-based empirical study across 49 multi-objective SE tasks, comparing LLM-generated warm starts against Gaussian Process Models and Tree of Parzen Estimators, with a rigorous evaluation using Chebyshev distance and Scott-Knott statistics. Results show that LLMs substantially reduce labeling effort and improve outcomes for low- and medium-dimensional problems, but their advantage diminishes in high-dimensional settings where Bayesian methods excel. The work demonstrates the potential of LLMs to complement traditional Bayesian approaches in SE optimization and provides an open, reproducible benchmark for further research into warm-start strategies and dimensionality-aware learning.

Abstract

When SE data is scarce, "active learners" use models learned from tiny samples of the data to find the next most informative example to label. In this way, effective models can be generated using very little data. For multi-objective software engineering (SE) tasks, active learning can benefit from an effective set of initial guesses (also known as "warm starts"). This paper explores the use of Large Language Models (LLMs) for creating warm-starts. Those results are compared against Gaussian Process Models and Tree of Parzen Estimators. For 49 SE tasks, LLM-generated warm starts significantly improved the performance of low- and medium-dimensional tasks. However, LLM effectiveness diminishes in high-dimensional problems, where Bayesian methods like Gaussian Process Models perform best.
Paper Structure (44 sections, 8 equations, 5 figures, 12 tables)

This paper contains 44 sections, 8 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Our TPE active learner explores the decision space of a two-class classifier.
  • Figure 2: Warm-starting with LLM Synthesized examples in cycle 0 of active learning
  • Figure 3: Average optimizations seen across all data sets. Most improvement were after a few dozens samples Let b4.mu and b4.lo be the mean and smallest Chebyshev distances seen in the original data. Let now.mu and now.sd be the mean and standard deviation of the best Chebyshevs seen in 20 repeats of our active learning experiments-(in this case, LLM warm starts followed by exploit). The blue plot shows (now.mu - b4.lo) / (b4.mu - b4.lo); i.e. the improvement seen by optimization, normalized by the maximum possible improvement (and for the blue line in this plot, lower values are better).
  • Figure 4: Median results from the Villalobos et al. model (shown in green) estimate that by 2028, we will run out of new textual data needed to train bigger and better LLMs villalobosposition.
  • Figure 5: A sample of runtimes from these experiments.