Table of Contents
Fetching ...

CoPrompter: User-Centric Evaluation of LLM Instruction Alignment for Improved Prompt Engineering

Ishika Joshi, Simra Shahid, Shreeya Venneti, Manushree Vasu, Yantao Zheng, Yunyao Li, Balaji Krishnamurthy, Gromit Yeuk-Yin Chan

TL;DR

CoPrompter improves the ability to identify and refine instruction alignment with prompt requirements over traditional methods, helps them understand where and how frequently models fail to follow user’s prompt requirements, and helps in clarifying their own requirements, giving them greater control over the response evaluation process.

Abstract

Ensuring large language models' (LLMs) responses align with prompt instructions is crucial for application development. Based on our formative study with industry professionals, the alignment requires heavy human involvement and tedious trial-and-error especially when there are many instructions in the prompt. To address these challenges, we introduce CoPrompter, a framework that identifies misalignment based on assessing multiple LLM responses with criteria. It proposes a method to generate evaluation criteria questions derived directly from prompt requirements and an interface to turn these questions into a user-editable checklist. Our user study with industry prompt engineers shows that CoPrompter improves the ability to identify and refine instruction alignment with prompt requirements over traditional methods, helps them understand where and how frequently models fail to follow user's prompt requirements, and helps in clarifying their own requirements, giving them greater control over the response evaluation process. We also present the design lessons to underscore our system's potential to streamline the prompt engineering process.

CoPrompter: User-Centric Evaluation of LLM Instruction Alignment for Improved Prompt Engineering

TL;DR

CoPrompter improves the ability to identify and refine instruction alignment with prompt requirements over traditional methods, helps them understand where and how frequently models fail to follow user’s prompt requirements, and helps in clarifying their own requirements, giving them greater control over the response evaluation process.

Abstract

Ensuring large language models' (LLMs) responses align with prompt instructions is crucial for application development. Based on our formative study with industry professionals, the alignment requires heavy human involvement and tedious trial-and-error especially when there are many instructions in the prompt. To address these challenges, we introduce CoPrompter, a framework that identifies misalignment based on assessing multiple LLM responses with criteria. It proposes a method to generate evaluation criteria questions derived directly from prompt requirements and an interface to turn these questions into a user-editable checklist. Our user study with industry prompt engineers shows that CoPrompter improves the ability to identify and refine instruction alignment with prompt requirements over traditional methods, helps them understand where and how frequently models fail to follow user's prompt requirements, and helps in clarifying their own requirements, giving them greater control over the response evaluation process. We also present the design lessons to underscore our system's potential to streamline the prompt engineering process.

Paper Structure

This paper contains 47 sections, 4 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: (a) Users enter prompt requirements in the 'Build Your Evaluator' tab. (b) Atomic instructions and criteria questions are extracted for evaluating responses, with (b1) options to modify them before saving. (c) In the 'Analyse Your Prompt' tab, users input a prompt, select a model, and choose the number of responses for evaluation. (d) Responses are evaluated based on the criteria, and alignment scores are displayed in the Alignment Report Card. (e) Users can view detailed scores, generated responses, CoPrompter's score, and rationale, and adjust the prompt accordingly. (f) CoPrompter categorizes alignment by content, style, and instruction type. (g) The prompt can be iteratively improved and retested. (h) Sample responses can also be tested against the evaluation criteria.
  • Figure 2: The CoPrompter system workflow begins with (1) the Evaluation Criteria Generation Module, where user prompt requirements (guidelines) are broken down into atomic instructions, which are then formulated as criteria questions (CQs) with ground truth labels and metadata tags (e.g., main task, subtask, or format-related). Users can adjust these criteria in (2) the Update Criteria Module by editing, deleting, or adding new CQs. Next, in (3) the Prompt Response Generation Module, users input their prompt, select an LLM, and generate responses. The responses are evaluated against the defined criteria, with results displayed in (4) the Alignment Report Card Module. This report shows aligned and misaligned requirements, allowing users to explore misalignments in detail. Based on the feedback, users can refine their prompt or criteria, iterating as needed to improve alignment.
  • Figure 3: This figure showcases the categorization of instructions as 'Subjective' or 'Objective' by CoPrompter. Objective tasks are clearly defined instructions, whereas subjective instructions can be interpreted in multiple ways. CoPrompter also provides the user with different interpretations of subjective instructions, along with a positive and negative example for each.
  • Figure 4: Prompt Guidelines prepared by P4
  • Figure 5: P3's prompt guideline for 'Build Your Evaluator' Tab and prompt draft for 'Analyse Your Prompt' Tab
  • ...and 11 more figures