Table of Contents
Fetching ...

AutoMedPrompt: A New Framework for Optimizing LLM Medical Prompts Using Textual Gradients

Sean Wu, Michael Koo, Fabien Scalzo, Ira Kurtz

TL;DR

AutoMedPrompt introduces a TextGrad-based framework for automatic medical system-prompt optimization, enabling open-source Llama 3 to surpass several proprietary models on MedQA, PubMedQA, and NephSAP. By treating the system prompt as the trainable parameter and backpropagating textual gradients from NL loss, the approach tailors prompts to specific medical tasks without weights updates. The method achieves state-of-the-art results on PubMedQA (82.6%), and strong performance on MedQA (77.7%) and NephSAP (63.8%), outperforming previous prompting strategies and at times rivaling or exceeding proprietary models. The work emphasizes the advantages of task-specific prompt optimization for democratizing high-performing medical LLMs, with open-source code and data provided for reproducibility.

Abstract

Large language models (LLMs) have demonstrated increasingly sophisticated performance in medical and other fields of knowledge. Traditional methods of creating specialist LLMs require extensive fine-tuning and training of models on large datasets. Recently, prompt engineering, instead of fine-tuning, has shown potential to boost the performance of general foundation models. However, prompting methods such as chain-of-thought (CoT) may not be suitable for all subspecialty, and k-shot approaches may introduce irrelevant tokens into the context space. We present AutoMedPrompt, which explores the use of textual gradients to elicit medically relevant reasoning through system prompt optimization. AutoMedPrompt leverages TextGrad's automatic differentiation via text to improve the ability of general foundation LLMs. We evaluated AutoMedPrompt on Llama 3, an open-source LLM, using several QA benchmarks, including MedQA, PubMedQA, and the nephrology subspecialty-specific NephSAP. Our results show that prompting with textual gradients outperforms previous methods on open-source LLMs and surpasses proprietary models such as GPT-4, Claude 3 Opus, and Med-PaLM 2. AutoMedPrompt sets a new state-of-the-art (SOTA) performance on PubMedQA with an accuracy of 82.6$\%$, while also outperforming previous prompting strategies on open-sourced models for MedQA (77.7$\%$) and NephSAP (63.8$\%$).

AutoMedPrompt: A New Framework for Optimizing LLM Medical Prompts Using Textual Gradients

TL;DR

AutoMedPrompt introduces a TextGrad-based framework for automatic medical system-prompt optimization, enabling open-source Llama 3 to surpass several proprietary models on MedQA, PubMedQA, and NephSAP. By treating the system prompt as the trainable parameter and backpropagating textual gradients from NL loss, the approach tailors prompts to specific medical tasks without weights updates. The method achieves state-of-the-art results on PubMedQA (82.6%), and strong performance on MedQA (77.7%) and NephSAP (63.8%), outperforming previous prompting strategies and at times rivaling or exceeding proprietary models. The work emphasizes the advantages of task-specific prompt optimization for democratizing high-performing medical LLMs, with open-source code and data provided for reproducibility.

Abstract

Large language models (LLMs) have demonstrated increasingly sophisticated performance in medical and other fields of knowledge. Traditional methods of creating specialist LLMs require extensive fine-tuning and training of models on large datasets. Recently, prompt engineering, instead of fine-tuning, has shown potential to boost the performance of general foundation models. However, prompting methods such as chain-of-thought (CoT) may not be suitable for all subspecialty, and k-shot approaches may introduce irrelevant tokens into the context space. We present AutoMedPrompt, which explores the use of textual gradients to elicit medically relevant reasoning through system prompt optimization. AutoMedPrompt leverages TextGrad's automatic differentiation via text to improve the ability of general foundation LLMs. We evaluated AutoMedPrompt on Llama 3, an open-source LLM, using several QA benchmarks, including MedQA, PubMedQA, and the nephrology subspecialty-specific NephSAP. Our results show that prompting with textual gradients outperforms previous methods on open-source LLMs and surpasses proprietary models such as GPT-4, Claude 3 Opus, and Med-PaLM 2. AutoMedPrompt sets a new state-of-the-art (SOTA) performance on PubMedQA with an accuracy of 82.6, while also outperforming previous prompting strategies on open-sourced models for MedQA (77.7) and NephSAP (63.8).

Paper Structure

This paper contains 23 sections, 7 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Schematic of AutoMedPrompt, where textual gradients can be superior to traditional prompting strategies.