Table of Contents
Fetching ...

From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT

Jace Grandinetti, Rafe McBeth

TL;DR

The paper tackles the challenge of domain-specific accuracy for large language models in medical physics by avoiding fine-tuning and instead leveraging a retrieval augmented Chain-of-Thought framework (ARCoT). ARCoT integrates a domain-specific retrieval system, Step-Back prompting, and a Re-Ranking Transformer to surface relevant material and guide multi-step reasoning. On a RAPHEX Therapy exam benchmark, ARCoT substantially improved performance across leading LLMs, notably pushing GPT-4 from $67\%$ to about $90\%$, and achieving an average improvement of $47\%$ over base models. The work demonstrates a practical, model-agnostic path to enhance accuracy and reduce hallucinations in specialized domains with broad applicability.

Abstract

Large Language Models (LLMs) have achieved remarkable progress, yet their application in specialized fields, such as medical physics, remains challenging due to the need for domain-specific knowledge. This study introduces ARCoT (Adaptable Retrieval-based Chain of Thought), a framework designed to enhance the domain-specific accuracy of LLMs without requiring fine-tuning or extensive retraining. ARCoT integrates a retrieval mechanism to access relevant domain-specific information and employs step-back and chain-of-thought prompting techniques to guide the LLM's reasoning process, ensuring more accurate and context-aware responses. Benchmarking on a medical physics multiple-choice exam, our model outperformed standard LLMs and reported average human performance, demonstrating improvements of up to 68% and achieving a high score of 90%. This method reduces hallucinations and increases domain-specific performance. The versatility and model-agnostic nature of ARCoT make it easily adaptable to various domains, showcasing its significant potential for enhancing the accuracy and reliability of LLMs in specialized fields.

From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT

TL;DR

The paper tackles the challenge of domain-specific accuracy for large language models in medical physics by avoiding fine-tuning and instead leveraging a retrieval augmented Chain-of-Thought framework (ARCoT). ARCoT integrates a domain-specific retrieval system, Step-Back prompting, and a Re-Ranking Transformer to surface relevant material and guide multi-step reasoning. On a RAPHEX Therapy exam benchmark, ARCoT substantially improved performance across leading LLMs, notably pushing GPT-4 from to about , and achieving an average improvement of over base models. The work demonstrates a practical, model-agnostic path to enhance accuracy and reduce hallucinations in specialized domains with broad applicability.

Abstract

Large Language Models (LLMs) have achieved remarkable progress, yet their application in specialized fields, such as medical physics, remains challenging due to the need for domain-specific knowledge. This study introduces ARCoT (Adaptable Retrieval-based Chain of Thought), a framework designed to enhance the domain-specific accuracy of LLMs without requiring fine-tuning or extensive retraining. ARCoT integrates a retrieval mechanism to access relevant domain-specific information and employs step-back and chain-of-thought prompting techniques to guide the LLM's reasoning process, ensuring more accurate and context-aware responses. Benchmarking on a medical physics multiple-choice exam, our model outperformed standard LLMs and reported average human performance, demonstrating improvements of up to 68% and achieving a high score of 90%. This method reduces hallucinations and increases domain-specific performance. The versatility and model-agnostic nature of ARCoT make it easily adaptable to various domains, showcasing its significant potential for enhancing the accuracy and reliability of LLMs in specialized fields.
Paper Structure (10 sections, 2 figures, 1 table)

This paper contains 10 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Architecture of the proposed ARCoT framework with an example of a user query. A hybrid SB prompting approach is implemented after the input query to improve similarity results with the original embedded prompt. A re-ranking transformer filters results with the highest relevance and a CoT prompt is used to further enhance model inference.
  • Figure 2: Radar plots depicting the benchmark scores of each LLM using the ARCoT framework (solid) against the base model (dashed and filled). Edges of each plot correspond to a score of 100%.