From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Harsha Nori; Naoto Usuyama; Nicholas King; Scott Mayer McKinney; Xavier Fernandes; Sheng Zhang; Eric Horvitz

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz

TL;DR

The paper examines run-time strategies for large language models in medical tasks, comparing Medprompt-enhanced GPT-4 with OpenAI's o1-preview to understand how reasoning-native inference shapes performance. It systematically benchmarks o1-preview across MedQA, MedMCQA, MMLU, NCLEX, and JMLE-2024, and analyzes prompting techniques, reasoning tokens, and ensembling. The findings show o1-preview often surpasses Medprompt-augmented GPT-4, while prompting remains less central for reasoning-native models and few-shot prompting can even hurt performance; however, ensembling improves accuracy at increased cost. The work highlights a new cost-accuracy Pareto frontier, underscores benchmark saturation, and outlines future directions in metareasoning, input optimization, external resource integration, and multi-agent runtimes for inference-time LLM computation in medicine.

Abstract

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

TL;DR

Abstract

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)