Self-Refinement of Language Models from External Proxy Metrics Feedback

Keshav Ramji; Young-Suk Lee; Ramón Fernandez Astudillo; Md Arafat Sultan; Tahira Naseem; Asim Munawar; Radu Florian; Salim Roukos

Self-Refinement of Language Models from External Proxy Metrics Feedback

Keshav Ramji, Young-Suk Lee, Ramón Fernandez Astudillo, Md Arafat Sultan, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos

TL;DR

Proxy Metric-based Self-Refinement (ProMiSe) is introduced, which enables an LLM to refine its own initial response along key dimensions of quality guided by external metrics feedback, yielding an overall better final response.

Abstract

It is often desirable for Large Language Models (LLMs) to capture multiple objectives when providing a response. In document-grounded response generation, for example, agent responses are expected to be relevant to a user's query while also being grounded in a given document. In this paper, we introduce Proxy Metric-based Self-Refinement (ProMiSe), which enables an LLM to refine its own initial response along key dimensions of quality guided by external metrics feedback, yielding an overall better final response. ProMiSe leverages feedback on response quality through principle-specific proxy metrics, and iteratively refines its response one principle at a time. We apply ProMiSe to open source language models Flan-T5-XXL and Llama-2-13B-Chat, to evaluate its performance on document-grounded question answering datasets, MultiDoc2Dial and QuAC, demonstrating that self-refinement improves response quality. We further show that fine-tuning Llama-2-13B-Chat on the synthetic dialogue data generated by ProMiSe yields significant performance improvements over the zero-shot baseline as well as a supervised fine-tuned model on human annotated data.

Self-Refinement of Language Models from External Proxy Metrics Feedback

TL;DR

Abstract

Paper Structure (37 sections, 1 equation, 7 figures, 6 tables, 2 algorithms)

This paper contains 37 sections, 1 equation, 7 figures, 6 tables, 2 algorithms.

Introduction
Algorithm
Initial Response Generation
Response Refinement
Determining Improvement
Evidence: Question Answering
Set of Principles
In-Context Demonstration Selection
Response and Query Generation.
Principle Refinement.
External Proxy Metrics
ROUGE Metrics.
WeCheck: Factual Consistency Checker.
Experimental Results and Discussion
Evaluation Datasets.
...and 22 more sections

Figures (7)

Figure 1: A high-level overview of our proposed self-refinement algorithm for content-grounded question answering, with both initial response generation and iterative refinement performed with the same Large Language Model $\mathcal{M}$.
Figure 2: GPT-4-as-a-Judge results on Flan-T5-XXL for MultiDoc2Dial (MD2D) and QuAC. With 2551 randomly sampled instances from the MultiDoc2Dial test set, we examine those for which the initial and final response differ: 495 samples for ROUGE-only, 131 samples for RM-only (WeCheck), and 504 samples for ROUGE + RM. We perform a similar analysis with all 1000 QuAC test set instances; the respective counts are: 193 samples for ROUGE-only, 65 samples for RM-only, and 224 samples for ROUGE + RM.
Figure 3: Above is the initial generation prompt, containing the instruction and the three in-context exemplars drawn from the train set of MultiDoc2Dial multidoc2dial, omitting the current sample inputs (document and context). The exemplars demonstrate question answering given the conversation history, and are separated by "###".
Figure 4: Query generation prompt (q in Appendix \ref{['app:section-A']}'s algorithm), containing an instruction and three in-context demonstrations of user queries given a document (separated by "###"), omitting the current instance inputs.
Figure 5: Refinement prompt ($r_p$) with respect to the specificity principle, with an instruction and three in-context demonstrations. Our instruction explicitly suggests to improve on specificity, and that in the provided in-context exemplars, the latter response (Agent response 2) is a specificity improvement over the former response (Agent response 1). Notably, we demonstrate to the model that "Let's make this response more specific" is an utterance in between the worse and better responses. Each exemplar and the "more specific" (gold) response is derived from the MultiDoc2Dial train set, while the "not specific" response is developed by a human annotator, bootstrapping off the gold response. The three exemplars are separated by "###", and the above prompt omits the current instance inputs.
...and 2 more figures

Self-Refinement of Language Models from External Proxy Metrics Feedback

TL;DR

Abstract

Self-Refinement of Language Models from External Proxy Metrics Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (7)