Table of Contents
Fetching ...

Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction

Saurabh Srivastava, Ziyu Yao

TL;DR

The paper presents the first systematic study of prompt optimization for Large Reasoning Models (LRMs) using end-to-end event extraction as a case study. By evaluating LRMs and general-purpose LLMs as both task models and prompt optimizers within a Monte Carlo Tree Search framework, the authors show that LRMs gain substantially from prompt optimization and often outperform LLMs, even when tuned as optimizers. The results generalize beyond event extraction to tasks like Geometric Shapes and NCBI Disease NER, where LRMs similarly excel as optimizers. An error analysis reveals that LRM-optimized prompts reduce common extraction errors and that LRMs provide faster, more stable convergence in optimization, highlighting their potential as both consumers and producers of high-quality prompts across diverse tasks.

Abstract

Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose Large Language Models (LLMs) (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Our finding also generalizes to tasks beyond event extraction. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines.

Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction

TL;DR

The paper presents the first systematic study of prompt optimization for Large Reasoning Models (LRMs) using end-to-end event extraction as a case study. By evaluating LRMs and general-purpose LLMs as both task models and prompt optimizers within a Monte Carlo Tree Search framework, the authors show that LRMs gain substantially from prompt optimization and often outperform LLMs, even when tuned as optimizers. The results generalize beyond event extraction to tasks like Geometric Shapes and NCBI Disease NER, where LRMs similarly excel as optimizers. An error analysis reveals that LRM-optimized prompts reduce common extraction errors and that LRMs provide faster, more stable convergence in optimization, highlighting their potential as both consumers and producers of high-quality prompts across diverse tasks.

Abstract

Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose Large Language Models (LLMs) (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Our finding also generalizes to tasks beyond event extraction. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines.

Paper Structure

This paper contains 40 sections, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Summary of our main results, where LRMs and LLMs are used as either the task model ($\mathcal{M}_{task}$) or the optimizer ($\mathcal{M}_{opt}$) in prompt optimization, and we observed a strong advantage of LRMs over LLMs.
  • Figure 2: An example prompt for end-to-end Event Extraction (EE) used in our experiments, consisting of a task instruction and an event schema. The event schema contains information about the labels that are represented as Python classes and event guidelines defining both the event classes and the arguments. In prompt optimization, we refine both the task instruction and event guidelines (shown for two events; others omitted due to space limits) to generate more effective prompts for the task model.
  • Figure 3: Overview of our prompt optimization framework. At each iteration, a zero-shot task LLM generates outputs, while a separate optimizer LLM analyzes the errors and updates the prompt, including task instructions and event guidelines, accordingly. This process continues over batches of training samples $\mathcal{D}_{train}$, and the final optimized prompt is evaluated on the development set to determine the node reward $r_t$.
  • Figure 4: Convergence analysis of prompt optimization across different task models with two optimizers---DeepSeek-R1 (left) and GPT-4.5 (right). Task models converge faster with minimal variance when their prompts are optimized by LRMs.
  • Figure 5: (a) A survival plot showing the % of prompts (y-axis) that achieve at least a given AC score (x-axis) for DeepSeek-R1 across different optimizers. (b) Prompt length vs. AC score across the best-performing full MCTS configuration for each task model on dev set. (c) Error categorization for DeepSeek-R1 as the task model with various optimizers.
  • ...and 2 more figures