A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation

Bhaskarjit Sarmah; Kriti Dutta; Anna Grigoryan; Sachin Tiwari; Stefano Pasquali; Dhagash Mehta

A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation

Bhaskarjit Sarmah, Kriti Dutta, Anna Grigoryan, Sachin Tiwari, Stefano Pasquali, Dhagash Mehta

TL;DR

Problem: Align LLM prompts and evaluation signals with human annotations for hallucination detection. Approach: A comparative study of five DSPy teleprompter algorithms (COPRO, MIPRO, BootstrapFewShot, BootstrapFewShot Optuna, KNN Few Shot) using the HaluBench benchmark. Contributions: empirical evidence that optimized prompts can surpass baselines in detecting hallucinations, with dataset-dependent gains and insights into when different teleprompters excel. Significance: demonstrates DSPy as a fast, cost-effective pathway to better task alignment without model weights updates, and suggests directions for reducing data-source bias and combining prompt optimization with instruction tuning.

Abstract

We argue that the Declarative Self-improving Python (DSPy) optimizers are a way to align the large language model (LLM) prompts and their evaluations to the human annotations. We present a comparative analysis of five teleprompter algorithms, namely, Cooperative Prompt Optimization (COPRO), Multi-Stage Instruction Prompt Optimization (MIPRO), BootstrapFewShot, BootstrapFewShot with Optuna, and K-Nearest Neighbor Few Shot, within the DSPy framework with respect to their ability to align with human evaluations. As a concrete example, we focus on optimizing the prompt to align hallucination detection (using LLM as a judge) to human annotated ground truth labels for a publicly available benchmark dataset. Our experiments demonstrate that optimized prompts can outperform various benchmark methods to detect hallucination, and certain telemprompters outperform the others in at least these experiments.

A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation

TL;DR

Abstract

A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents