Table of Contents
Fetching ...

COSMosFL: Ensemble of Small Language Models for Fault Localisation

Hyunjoon Cho, Sungmin Kang, Gabin An, Shin Yoo

TL;DR

This paper tackles the high cost and security concerns of large closed LLMs by proposing COSCosFL, an ensemble of open-source small language models for fault localisation. It builds on AutoFL by replacing proprietary models with four open SLMs and evaluating two voting schemes (equal vs. DE-optimised weighting) to balance FL accuracy against energy, time, and token usage. Experiments on the Defects4J benchmark show that the ensemble can achieve Pareto-optimal trade-offs and leverage model orthogonality to improve localisation performance under cost constraints. The work contributes a task-level ensemble framework for FL, demonstrates practical open-source deployment, and discusses future enhancements such as routing-based approaches and robust optimisation for explanations.

Abstract

LLMs are rapidly being adopted to build powerful tools and agents for software engineering, but most of them rely heavily on extremely large closed-source models. This, in turn, can hinder wider adoption due to security issues as well as financial cost and environmental impact. Recently, a number of open source Small Language Models (SLMs) are being released and gaining traction. While SLMs are smaller, more energy-efficient, and therefore easier to locally deploy, they tend to show worse performance when compared to larger closed LLMs. We present COSMos, a task-level LLM ensemble technique that uses voting mechanism, to provide a broader range of choice between SLMs and LLMs. We instantiate COSMos with an LLM-based Fault Localisation technique, AutoFL, and report the cost-benefit trade-off between LLM accuracy and various costs such as energy consumption, inference time, and the number of tokens used. An empirical evaluation using Defects4J shows that COSMos can build effective ensembles that can achieve Pareto-optimality in terms of FL accuracy and inference cost, when compared to individual models.

COSMosFL: Ensemble of Small Language Models for Fault Localisation

TL;DR

This paper tackles the high cost and security concerns of large closed LLMs by proposing COSCosFL, an ensemble of open-source small language models for fault localisation. It builds on AutoFL by replacing proprietary models with four open SLMs and evaluating two voting schemes (equal vs. DE-optimised weighting) to balance FL accuracy against energy, time, and token usage. Experiments on the Defects4J benchmark show that the ensemble can achieve Pareto-optimal trade-offs and leverage model orthogonality to improve localisation performance under cost constraints. The work contributes a task-level ensemble framework for FL, demonstrates practical open-source deployment, and discusses future enhancements such as routing-based approaches and robust optimisation for explanations.

Abstract

LLMs are rapidly being adopted to build powerful tools and agents for software engineering, but most of them rely heavily on extremely large closed-source models. This, in turn, can hinder wider adoption due to security issues as well as financial cost and environmental impact. Recently, a number of open source Small Language Models (SLMs) are being released and gaining traction. While SLMs are smaller, more energy-efficient, and therefore easier to locally deploy, they tend to show worse performance when compared to larger closed LLMs. We present COSMos, a task-level LLM ensemble technique that uses voting mechanism, to provide a broader range of choice between SLMs and LLMs. We instantiate COSMos with an LLM-based Fault Localisation technique, AutoFL, and report the cost-benefit trade-off between LLM accuracy and various costs such as energy consumption, inference time, and the number of tokens used. An empirical evaluation using Defects4J shows that COSMos can build effective ensembles that can achieve Pareto-optimality in terms of FL accuracy and inference cost, when compared to individual models.

Paper Structure

This paper contains 19 sections, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: Overview of our approach against AutoFL kangQuantitativeQualitativeEvaluation2024a with differences colored in red.
  • Figure 2: Overlap of bugs ranked at first by Llama3, Llama3.1, Mistral NeMo, and Qwen2.5-Coder. Each model is run 5 times without applying ensemble.
  • Figure 3: acc@k for R=20 for each model and ensemble approaches, alongside AutoFL’s reported GPT-3.5 performance.
  • Figure 4: Overlap of top-ranked bugs at each $k$ for a single sample ($R=20$) of DE and Equal Weight Ensembles.
  • Figure 5: Mean of $acc@1$ across runs for four single models and two ensemble approaches. Note that the ensemble techniques are only available at multiples of four runs.
  • ...and 5 more figures