Table of Contents
Fetching ...

Can Open-Source LLMs Compete with Commercial Models? Exploring the Few-Shot Performance of Current GPT Models in Biomedical Tasks

Samy Ateia, Udo Kruschwitz

TL;DR

This study evaluates open-source LLMs against commercial models within BioASQ's domain-specific retrieval-augmented QA setting. By combining few-shot and prompt-based strategies with a retrieval pipeline (Elasticsearch on PubMed) and optional Wikipedia context, the authors demonstrate that Mixtral 8x7B can approach top commercial models in 10-shot configurations while exposing notable zero-shot gaps. The results highlight the potential and practicality of offline/open-source LLMs for confidential biomedical tasks, but also reveal the fragility of gains from fine-tuning and external knowledge augmentation. The work emphasizes the importance of few-shot design and cost/time efficiency, offering a path toward competitive, privacy-preserving biomedical QA systems and suggesting avenues for future research in few-shot optimization and knowledge-source selection.

Abstract

Commercial large language models (LLMs), like OpenAI's GPT-4 powering ChatGPT and Anthropic's Claude 3 Opus, have dominated natural language processing (NLP) benchmarks across different domains. New competing Open-Source alternatives like Mixtral 8x7B or Llama 3 have emerged and seem to be closing the gap while often offering higher throughput and being less costly to use. Open-Source LLMs can also be self-hosted, which makes them interesting for enterprise and clinical use cases where sensitive data should not be processed by third parties. We participated in the 12th BioASQ challenge, which is a retrieval augmented generation (RAG) setting, and explored the performance of current GPT models Claude 3 Opus, GPT-3.5-turbo and Mixtral 8x7b with in-context learning (zero-shot, few-shot) and QLoRa fine-tuning. We also explored how additional relevant knowledge from Wikipedia added to the context-window of the LLM might improve their performance. Mixtral 8x7b was competitive in the 10-shot setting, both with and without fine-tuning, but failed to produce usable results in the zero-shot setting. QLoRa fine-tuning and Wikipedia context did not lead to measurable performance gains. Our results indicate that the performance gap between commercial and open-source models in RAG setups exists mainly in the zero-shot setting and can be closed by simply collecting few-shot examples for domain-specific use cases. The code needed to rerun these experiments is available through GitHub.

Can Open-Source LLMs Compete with Commercial Models? Exploring the Few-Shot Performance of Current GPT Models in Biomedical Tasks

TL;DR

This study evaluates open-source LLMs against commercial models within BioASQ's domain-specific retrieval-augmented QA setting. By combining few-shot and prompt-based strategies with a retrieval pipeline (Elasticsearch on PubMed) and optional Wikipedia context, the authors demonstrate that Mixtral 8x7B can approach top commercial models in 10-shot configurations while exposing notable zero-shot gaps. The results highlight the potential and practicality of offline/open-source LLMs for confidential biomedical tasks, but also reveal the fragility of gains from fine-tuning and external knowledge augmentation. The work emphasizes the importance of few-shot design and cost/time efficiency, offering a path toward competitive, privacy-preserving biomedical QA systems and suggesting avenues for future research in few-shot optimization and knowledge-source selection.

Abstract

Commercial large language models (LLMs), like OpenAI's GPT-4 powering ChatGPT and Anthropic's Claude 3 Opus, have dominated natural language processing (NLP) benchmarks across different domains. New competing Open-Source alternatives like Mixtral 8x7B or Llama 3 have emerged and seem to be closing the gap while often offering higher throughput and being less costly to use. Open-Source LLMs can also be self-hosted, which makes them interesting for enterprise and clinical use cases where sensitive data should not be processed by third parties. We participated in the 12th BioASQ challenge, which is a retrieval augmented generation (RAG) setting, and explored the performance of current GPT models Claude 3 Opus, GPT-3.5-turbo and Mixtral 8x7b with in-context learning (zero-shot, few-shot) and QLoRa fine-tuning. We also explored how additional relevant knowledge from Wikipedia added to the context-window of the LLM might improve their performance. Mixtral 8x7b was competitive in the 10-shot setting, both with and without fine-tuning, but failed to produce usable results in the zero-shot setting. QLoRa fine-tuning and Wikipedia context did not lead to measurable performance gains. Our results indicate that the performance gap between commercial and open-source models in RAG setups exists mainly in the zero-shot setting and can be closed by simply collecting few-shot examples for domain-specific use cases. The code needed to rerun these experiments is available through GitHub.
Paper Structure (22 sections, 13 tables)