Table of Contents
Fetching ...

A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation

Bhaskarjit Sarmah, Kriti Dutta, Anna Grigoryan, Sachin Tiwari, Stefano Pasquali, Dhagash Mehta

TL;DR

Problem: Align LLM prompts and evaluation signals with human annotations for hallucination detection. Approach: A comparative study of five DSPy teleprompter algorithms (COPRO, MIPRO, BootstrapFewShot, BootstrapFewShot Optuna, KNN Few Shot) using the HaluBench benchmark. Contributions: empirical evidence that optimized prompts can surpass baselines in detecting hallucinations, with dataset-dependent gains and insights into when different teleprompters excel. Significance: demonstrates DSPy as a fast, cost-effective pathway to better task alignment without model weights updates, and suggests directions for reducing data-source bias and combining prompt optimization with instruction tuning.

Abstract

We argue that the Declarative Self-improving Python (DSPy) optimizers are a way to align the large language model (LLM) prompts and their evaluations to the human annotations. We present a comparative analysis of five teleprompter algorithms, namely, Cooperative Prompt Optimization (COPRO), Multi-Stage Instruction Prompt Optimization (MIPRO), BootstrapFewShot, BootstrapFewShot with Optuna, and K-Nearest Neighbor Few Shot, within the DSPy framework with respect to their ability to align with human evaluations. As a concrete example, we focus on optimizing the prompt to align hallucination detection (using LLM as a judge) to human annotated ground truth labels for a publicly available benchmark dataset. Our experiments demonstrate that optimized prompts can outperform various benchmark methods to detect hallucination, and certain telemprompters outperform the others in at least these experiments.

A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation

TL;DR

Problem: Align LLM prompts and evaluation signals with human annotations for hallucination detection. Approach: A comparative study of five DSPy teleprompter algorithms (COPRO, MIPRO, BootstrapFewShot, BootstrapFewShot Optuna, KNN Few Shot) using the HaluBench benchmark. Contributions: empirical evidence that optimized prompts can surpass baselines in detecting hallucinations, with dataset-dependent gains and insights into when different teleprompters excel. Significance: demonstrates DSPy as a fast, cost-effective pathway to better task alignment without model weights updates, and suggests directions for reducing data-source bias and combining prompt optimization with instruction tuning.

Abstract

We argue that the Declarative Self-improving Python (DSPy) optimizers are a way to align the large language model (LLM) prompts and their evaluations to the human annotations. We present a comparative analysis of five teleprompter algorithms, namely, Cooperative Prompt Optimization (COPRO), Multi-Stage Instruction Prompt Optimization (MIPRO), BootstrapFewShot, BootstrapFewShot with Optuna, and K-Nearest Neighbor Few Shot, within the DSPy framework with respect to their ability to align with human evaluations. As a concrete example, we focus on optimizing the prompt to align hallucination detection (using LLM as a judge) to human annotated ground truth labels for a publicly available benchmark dataset. Our experiments demonstrate that optimized prompts can outperform various benchmark methods to detect hallucination, and certain telemprompters outperform the others in at least these experiments.

Paper Structure

This paper contains 23 sections, 10 tables.