Analysing Environmental Efficiency in AI for X-Ray Diagnosis
Liam Kearns
TL;DR
The paper addresses the environmental and safety challenges of using AI for Covid-19 detection in chest X-rays by benchmarking 14 configurations that combine large language models and small discriminative classifiers within a Mendix application. It finds that a locally deployed discriminative model such as Covid-Net achieves the highest accuracy with a dramatically smaller carbon footprint than large LLMs, while LLMs struggle with probabilistic calibration and exhibit bias toward positive diagnoses. Knowledge bases can improve LLM accuracy but produce mixed changes in energy consumption, underscoring that retrieval-augmented approaches are not uniformly beneficial. The study highlights the need for cautious deployment of generative AI in medical classification tasks and supports prioritizing task-specific, energy-efficient discriminative models for sustainable, trustworthy AI-enabled diagnostics, with LLMs serving as supplementary tools when properly constrained.
Abstract
The integration of AI tools into medical applications has aimed to improve the efficiency of diagnosis. The emergence of large language models (LLMs), such as ChatGPT and Claude, has expanded this integration even further. Because of LLM versatility and ease of use through APIs, these larger models are often utilised even though smaller, custom models can be used instead. In this paper, LLMs and small discriminative models are integrated into a Mendix application to detect Covid-19 in chest X-rays. These discriminative models are also used to provide knowledge bases for LLMs to improve accuracy. This provides a benchmark study of 14 different model configurations for comparison of accuracy and environmental impact. The findings indicated that while smaller models reduced the carbon footprint of the application, the output was biased towards a positive diagnosis and the output probabilities were lacking confidence. Meanwhile, restricting LLMs to only give probabilistic output caused poor performance in both accuracy and carbon footprint, demonstrating the risk of using LLMs as a universal AI solution. While using the smaller LLM GPT-4.1-Nano reduced the carbon footprint by 94.2% compared to the larger models, this was still disproportionate to the discriminative models; the most efficient solution was the Covid-Net model. Although it had a larger carbon footprint than other small models, its carbon footprint was 99.9% less than when using GPT-4.5-Preview, whilst achieving an accuracy of 95.5%, the highest of all models examined. This paper contributes to knowledge by comparing generative and discriminative models in Covid-19 detection as well as highlighting the environmental risk of using generative tools for classification tasks.
