Evaluating the effectiveness of LLM-based interoperability
Rodrigo Falcão, Stefan Schweitzer, Julien Siebert, Emily Calvet, Frank Elberzhager
TL;DR
This work tackles the interoperability challenge in dynamic systems of systems by evaluating two autonomous, LLM-based strategies (DIRECT and CODEGEN) for runtime data adaptation across four dataset versions in an agricultural use case. Thirteen open-source LLMs up to 70B params are tested under zero-shot conditions, using a GeoJSON vs John Deere data conversion task and a GQM-based evaluation with pass@1 metrics and statistical tests. Results show that some models, notably qwen2.5-coder:32b, can achieve high effectiveness, especially with DIRECT on simpler datasets and CODEGEN on more complex ones, while unit-conversion tasks (dataset v4) reveal reliability gaps; CODEGEN generally yields more deterministic, reusable solutions. The findings highlight the potential of autonomous interoperability via LLMs but also the need for broader domain evaluation and further work on reliability, scalability, and security before production deployment.
Abstract
Background: Systems of systems are becoming increasingly dynamic and heterogeneous, and this adds pressure on the long-standing challenge of interoperability. Besides its technical aspect, interoperability has also an economic side, as development time efforts are required to build the interoperability artifacts. Objectives: With the recent advances in the field of large language models (LLMs), we aim at analyzing the effectiveness of LLM-based strategies to make systems interoperate autonomously, at runtime, without human intervention. Method: We selected 13 open source LLMs and curated four versions of a dataset in the agricultural interoperability use case. We performed three runs of each model with each version of the dataset, using two different strategies. Then we compared the effectiveness of the models and the consistency of their results across multiple runs. Results: qwen2.5-coder:32b was the most effective model using both strategies DIRECT (average pass@1 >= 0.99) and CODEGEN (average pass@1 >= 0.89) in three out of four dataset versions. In the fourth dataset version, which included an unit conversion, all models using the strategy DIRECT failed, whereas using CODEGEN qwen2.5-coder:32b succeeded with an average pass@1 = 0.75. Conclusion: Some LLMs can make systems interoperate autonomously. Further evaluation in different domains is recommended, and further research on reliability strategies should be conducted.
