Table of Contents
Fetching ...

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz

TL;DR

The paper examines run-time strategies for large language models in medical tasks, comparing Medprompt-enhanced GPT-4 with OpenAI's o1-preview to understand how reasoning-native inference shapes performance. It systematically benchmarks o1-preview across MedQA, MedMCQA, MMLU, NCLEX, and JMLE-2024, and analyzes prompting techniques, reasoning tokens, and ensembling. The findings show o1-preview often surpasses Medprompt-augmented GPT-4, while prompting remains less central for reasoning-native models and few-shot prompting can even hurt performance; however, ensembling improves accuracy at increased cost. The work highlights a new cost-accuracy Pareto frontier, underscores benchmark saturation, and outlines future directions in metareasoning, input optimization, external resource integration, and multi-agent runtimes for inference-time LLM computation in medicine.

Abstract

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

TL;DR

The paper examines run-time strategies for large language models in medical tasks, comparing Medprompt-enhanced GPT-4 with OpenAI's o1-preview to understand how reasoning-native inference shapes performance. It systematically benchmarks o1-preview across MedQA, MedMCQA, MMLU, NCLEX, and JMLE-2024, and analyzes prompting techniques, reasoning tokens, and ensembling. The findings show o1-preview often surpasses Medprompt-augmented GPT-4, while prompting remains less central for reasoning-native models and few-shot prompting can even hurt performance; however, ensembling improves accuracy at increased cost. The work highlights a new cost-accuracy Pareto frontier, underscores benchmark saturation, and outlines future directions in metareasoning, input optimization, external resource integration, and multi-agent runtimes for inference-time LLM computation in medicine.

Abstract

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

Paper Structure

This paper contains 33 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Pareto frontier showing accuracy versus total API cost (log scale) on the MedQA benchmark (1273 questions total). We compare o1-preview (Sep 2024), GPT-4o (Aug 2024), and GPT-4 Turbo (Nov 2023) with various run-time steering strategies.
  • Figure 2: (a) Comparative analyses of performance of multiple models on MedQA. (b) Comparisons on a wide range of medical challenge benchmarks.
  • Figure 3: Visual illustration of Medprompt components and additive contributions to performance on MedQA. The prompting strategy combines $k$NN-based few-shot example selection, GPT-4--generated chain-of-thought prompting, and answer-choice shuffled ensembling. Relative contributions of each component are shown at the bottom. Figure from nori2023can
  • Figure 4: JMLE-2024: National medical licensing exam held in Japan in February 2024
  • Figure 5: Comparison of prompting techniques on MedQA with the o1-preview model. Error bars indicate one standard deviation from three independent samples.
  • ...and 8 more figures