Table of Contents
Fetching ...

In-Context Learning with Reinforcement Learning for Incomplete Utterance Rewriting

Haowei Du, Dongyan Zhao

TL;DR

This work targets Incomplete Utterance Rewriting (IUR) within an in-context learning (ICL) framework and introduces a policy-based reinforcement learning approach (RLS) to select demonstrations. An LM selector encodes candidate examples and is trained with policy gradients using ICL rewards from the fixed LLM generator, aligning demonstration quality with rewriting performance. Across CANARD, TASK, and REWRITE, RLS consistently outperforms sparse/dense retrievers and even supervised finetuning baselines in few-shot settings, underscoring the value of direct LLM feedback for demonstration selection. The findings highlight the importance of balancing textual similarity, linguistic complexity, and test-case relevance, and show that larger LLMs can further amplify the benefits of RL-informed demonstration selection.

Abstract

In-context learning (ICL) of large language models (LLMs) has attracted increasing attention in the community where LLMs make predictions only based on instructions augmented with a few examples. Existing example selection methods for ICL utilize sparse or dense retrievers and derive effective performance. However, these methods do not utilize direct feedback of LLM to train the retriever and the examples selected can not necessarily improve the analogy ability of LLM. To tackle this, we propose our policy-based reinforcement learning framework for example selection (RLS), which consists of a language model (LM) selector and an LLM generator. The LM selector encodes the candidate examples into dense representations and selects the top-k examples into the demonstration for LLM. The outputs of LLM are adopted to compute the reward and policy gradient to optimize the LM selector. We conduct experiments on different datasets and significantly outperform existing example selection methods. Moreover, our approach shows advantages over supervised finetuning (SFT) models in few shot setting. Further experiments show the balance of abundance and the similarity with the test case of examples is important for ICL performance of LLM.

In-Context Learning with Reinforcement Learning for Incomplete Utterance Rewriting

TL;DR

This work targets Incomplete Utterance Rewriting (IUR) within an in-context learning (ICL) framework and introduces a policy-based reinforcement learning approach (RLS) to select demonstrations. An LM selector encodes candidate examples and is trained with policy gradients using ICL rewards from the fixed LLM generator, aligning demonstration quality with rewriting performance. Across CANARD, TASK, and REWRITE, RLS consistently outperforms sparse/dense retrievers and even supervised finetuning baselines in few-shot settings, underscoring the value of direct LLM feedback for demonstration selection. The findings highlight the importance of balancing textual similarity, linguistic complexity, and test-case relevance, and show that larger LLMs can further amplify the benefits of RL-informed demonstration selection.

Abstract

In-context learning (ICL) of large language models (LLMs) has attracted increasing attention in the community where LLMs make predictions only based on instructions augmented with a few examples. Existing example selection methods for ICL utilize sparse or dense retrievers and derive effective performance. However, these methods do not utilize direct feedback of LLM to train the retriever and the examples selected can not necessarily improve the analogy ability of LLM. To tackle this, we propose our policy-based reinforcement learning framework for example selection (RLS), which consists of a language model (LM) selector and an LLM generator. The LM selector encodes the candidate examples into dense representations and selects the top-k examples into the demonstration for LLM. The outputs of LLM are adopted to compute the reward and policy gradient to optimize the LM selector. We conduct experiments on different datasets and significantly outperform existing example selection methods. Moreover, our approach shows advantages over supervised finetuning (SFT) models in few shot setting. Further experiments show the balance of abundance and the similarity with the test case of examples is important for ICL performance of LLM.
Paper Structure (27 sections, 4 equations, 1 figure, 11 tables)

This paper contains 27 sections, 4 equations, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Method Overview. Our approach consists of an LM example selector and an LLM generator.