Table of Contents
Fetching ...

Pareto-Optimized Open-Source LLMs for Healthcare via Context Retrieval

Jordi Bayarri-Planas, Ashwin Kumar Gururajan, Dario Garcia-Gasulla

TL;DR

The paper tackles the high cost and limited accessibility of proprietary LLMs for healthcare AI by deploying optimized context retrieval with open-source models. It introduces a reproducible CR pipeline, evaluates SC-CoT prompting, and builds an open-resource ecosystem to train and deploy cost-effective healthcare AI. OpenMedQA is proposed to evaluate open-ended medical QA, revealing a gap with MCQA and showing that DeepSeek-R1 thinking data and enhanced retrieval can bridge the gap for smaller models, approaching proprietary performance at lower cost. The results demonstrate a shifted Pareto frontier on MedQA, indicating practical implications for scalable, affordable, and reliable healthcare AI across resource-constrained settings.

Abstract

This study leverages optimized context retrieval to enhance open-source Large Language Models (LLMs) for cost-effective, high performance healthcare AI. We demonstrate that this approach achieves state-of-the-art accuracy on medical question answering at a fraction of the cost of proprietary models, significantly improving the cost-accuracy Pareto frontier on the MedQA benchmark. Key contributions include: (1) OpenMedQA, a novel benchmark revealing a performance gap in open-ended medical QA compared to multiple-choice formats; (2) a practical, reproducible pipeline for context retrieval optimization; and (3) open-source resources (Prompt Engine, CoT/ToT/Thinking databases) to empower healthcare AI development. By advancing retrieval techniques and QA evaluation, we enable more affordable and reliable LLM solutions for healthcare.

Pareto-Optimized Open-Source LLMs for Healthcare via Context Retrieval

TL;DR

The paper tackles the high cost and limited accessibility of proprietary LLMs for healthcare AI by deploying optimized context retrieval with open-source models. It introduces a reproducible CR pipeline, evaluates SC-CoT prompting, and builds an open-resource ecosystem to train and deploy cost-effective healthcare AI. OpenMedQA is proposed to evaluate open-ended medical QA, revealing a gap with MCQA and showing that DeepSeek-R1 thinking data and enhanced retrieval can bridge the gap for smaller models, approaching proprietary performance at lower cost. The results demonstrate a shifted Pareto frontier on MedQA, indicating practical implications for scalable, affordable, and reliable healthcare AI across resource-constrained settings.

Abstract

This study leverages optimized context retrieval to enhance open-source Large Language Models (LLMs) for cost-effective, high performance healthcare AI. We demonstrate that this approach achieves state-of-the-art accuracy on medical question answering at a fraction of the cost of proprietary models, significantly improving the cost-accuracy Pareto frontier on the MedQA benchmark. Key contributions include: (1) OpenMedQA, a novel benchmark revealing a performance gap in open-ended medical QA compared to multiple-choice formats; (2) a practical, reproducible pipeline for context retrieval optimization; and (3) open-source resources (Prompt Engine, CoT/ToT/Thinking databases) to empower healthcare AI development. By advancing retrieval techniques and QA evaluation, we enable more affordable and reliable LLM solutions for healthcare.
Paper Structure (11 sections, 3 figures, 5 tables)

This paper contains 11 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Enhanced Pareto Frontier of Accuracy vs. Cost on MedQA. The solid line represents the improved efficiency frontier, demonstrably surpassing the original Pareto frontier (dashed line). Circular markers indicate open-source models, while triangles represent closed models. The green shaded area visually highlights the region of significant cost-effective accuracy gains.
  • Figure 2: Components of the question-answering system based on context retrieval for LLMs.
  • Figure 3: Accuracy vs. $CO_2$ emissions for increasing ensemble sizes in the SC-CoT setting. The solid lines represent accuracy trends for each dataset, while the dashed black line indicates the average accuracy. The shaded bars show power consumption in kWh, highlighting the trade-off between performance gains and environmental cost.