Table of Contents
Fetching ...

Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain

Davide Mazzaccara, Alberto Testoni, Raffaella Bernardi

TL;DR

This paper proposes a method to enhance the informativeness of LLM-generated questions in 20-question game dialogues by creating pairs of low-EIG and high-EIG questions to apply a Direct Preference Optimization (DPO) algorithm.

Abstract

Questions are essential tools for acquiring the necessary information to complete information-seeking tasks. However, large language models (LLMs), especially open-source models, often perform poorly in generating informative questions, as measured by expected information gain (EIG). In this paper, we propose a method to enhance the informativeness of LLM-generated questions in 20-question game dialogues. We sample multiple questions from the same model (LLAMA 2-CHAT 7B) for each game and create pairs of low-EIG and high-EIG questions to apply a Direct Preference Optimization (DPO) algorithm. Our results show that this method produces more effective questions (in terms of EIG), even in domains different from those used to train the DPO model.

Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain

TL;DR

This paper proposes a method to enhance the informativeness of LLM-generated questions in 20-question game dialogues by creating pairs of low-EIG and high-EIG questions to apply a Direct Preference Optimization (DPO) algorithm.

Abstract

Questions are essential tools for acquiring the necessary information to complete information-seeking tasks. However, large language models (LLMs), especially open-source models, often perform poorly in generating informative questions, as measured by expected information gain (EIG). In this paper, we propose a method to enhance the informativeness of LLM-generated questions in 20-question game dialogues. We sample multiple questions from the same model (LLAMA 2-CHAT 7B) for each game and create pairs of low-EIG and high-EIG questions to apply a Direct Preference Optimization (DPO) algorithm. Our results show that this method produces more effective questions (in terms of EIG), even in domains different from those used to train the DPO model.

Paper Structure

This paper contains 19 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The proposed approach for constructing the datasets of dialogues for fine-tuning and preference optimization (DPO). From the original candidate set, the Questioner generates a question Q$_1$, and the Annotator provides the expected answer to each candidate. Expected Information Gain (EIG) is computed from the annotation: if the question is suboptimal in terms of EIG, other questions are sampled until an optimal question is reached (Q$_n$). The optimal question is paired with the suboptimal ones in the Preference dataset (DPO), whereas the Fine-Tuning (FT) dataset is composed of only 1-EIG questions.
  • Figure 2: Example from the BigBench. DPO asks grounded CS questions (highlighted with colors), identifying the subset of the target (e.g., emotions). It then asks a series of HS questions.
  • Figure 3: During sampling, Llama 2-chat (7B) plays the roles of the Questioner, the Annotator, and the Answerer. Questions are sampled from the Questioner and then evaluated by the Annotator. Once an optimal question is reached, the Answerer answers it. The optimal question and its answer are appended to the dialogue history. In this way, optimal questions are sampled not only for the first turn but also in follow-up turns. In training, the Questioner is trained with FT and DPO datasets. In testing, the zero-shot and trained Questioners play the 20 Questions Game with an external model as Answerer.
  • Figure 4: INLG 8 Example: half of the candidates are birds, and half are mammals. Both the zero-shot and DPO identify the bird category at the first turn ($EIG=1$). After the negative answer, the zero-shot asks a confirmation question about the remaining category (i.e., mammal), with an $EIG=0$. The DPO, instead, asks a more specific question on the remaining category (i.e., large hoofed, which are the ' elk' and ' buffalo'), with an $EIG=1$. The higher informativeness of DPO is further reflected in the number of questions required to reach the target: 3 questions in DPO vs. 4 questions in zero-shot.