Table of Contents
Fetching ...

LLMs can learn self-restraint through iterative self-reflection

Alexandre Piché, Aristides Milios, Dzmitry Bahdanau, Chris Pal

TL;DR

The paper introduces ReSearch, an iterative self-reflection framework that uses a utility function to train LLMs to modulate their outputs and abstain when uncertain. By generating synthetic data through self-evaluation and self-prompting, and by optimizing a scalar utility that rewards true claims and penalizes false ones, models learn self-restraint with no external references. Empirical results across biographies and historical-event tasks show reduced hallucinations and favorable Pareto fronts compared to baselines, with strong performance on the FActScore benchmark. The approach offers a principled way to calibrate completeness, accuracy, and abstention, enabling safer and more controllable LLM behavior in closed-book settings, while highlighting trade-offs between computation, detail, and veracity. Future work may integrate retrieval augmentation and explore broader behavioral controls beyond abstention.

Abstract

In order to be deployed safely, Large Language Models (LLMs) must be capable of dynamically adapting their behavior based on their level of knowledge and uncertainty associated with specific topics. This adaptive behavior, which we refer to as self-restraint, is non-trivial to teach since it depends on the internal knowledge of an LLM. By default, LLMs are trained to maximize the next token likelihood, which does not teach the model to modulate its answer based on its level of uncertainty. In order to learn self-restraint, we devise a utility function that can encourage the model to produce responses only when it is confident in them. This utility function can be used to score generation of different length and abstention. To optimize this function, we introduce ReSearch, a process of "self-reflection" consisting of iterative self-prompting and self-evaluation. We use the ReSearch algorithm to generate synthetic data on which we finetune our models. Compared to their original versions, our resulting models generate fewer \emph{hallucinations} overall at no additional inference cost, for both known and unknown topics, as the model learns to selectively restrain itself. In addition, our method elegantly incorporates the ability to abstain by augmenting the samples generated by the model during the search procedure with an answer expressing abstention.

LLMs can learn self-restraint through iterative self-reflection

TL;DR

The paper introduces ReSearch, an iterative self-reflection framework that uses a utility function to train LLMs to modulate their outputs and abstain when uncertain. By generating synthetic data through self-evaluation and self-prompting, and by optimizing a scalar utility that rewards true claims and penalizes false ones, models learn self-restraint with no external references. Empirical results across biographies and historical-event tasks show reduced hallucinations and favorable Pareto fronts compared to baselines, with strong performance on the FActScore benchmark. The approach offers a principled way to calibrate completeness, accuracy, and abstention, enabling safer and more controllable LLM behavior in closed-book settings, while highlighting trade-offs between computation, detail, and veracity. Future work may integrate retrieval augmentation and explore broader behavioral controls beyond abstention.

Abstract

In order to be deployed safely, Large Language Models (LLMs) must be capable of dynamically adapting their behavior based on their level of knowledge and uncertainty associated with specific topics. This adaptive behavior, which we refer to as self-restraint, is non-trivial to teach since it depends on the internal knowledge of an LLM. By default, LLMs are trained to maximize the next token likelihood, which does not teach the model to modulate its answer based on its level of uncertainty. In order to learn self-restraint, we devise a utility function that can encourage the model to produce responses only when it is confident in them. This utility function can be used to score generation of different length and abstention. To optimize this function, we introduce ReSearch, a process of "self-reflection" consisting of iterative self-prompting and self-evaluation. We use the ReSearch algorithm to generate synthetic data on which we finetune our models. Compared to their original versions, our resulting models generate fewer \emph{hallucinations} overall at no additional inference cost, for both known and unknown topics, as the model learns to selectively restrain itself. In addition, our method elegantly incorporates the ability to abstain by augmenting the samples generated by the model during the search procedure with an answer expressing abstention.
Paper Structure (19 sections, 5 equations, 11 figures, 11 tables, 1 algorithm)

This paper contains 19 sections, 5 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: Llama2 7b models trained on data synthetically generated by the ReSearch algorithm ($\star$) outperforms all baselines. This includes Llama2 7B chat ($\blacksquare$) and safety prompted ($\bullet$), Llama2 70B chat ($\CIRCLE$), and search DPO tian2023fine ($\blacktriangle$). Our model produces more claims that search DPO and Llama 70b and is at least as accurate. Mistral 7b models trained on data synthetically generated by the ReSearch algorithm ($\star$) also outperforms all baselines. This includes Mistral 7B ($\blacksquare$) and safety prompted ($\bullet$), Mixtral 8x7B ($\CIRCLE$), and FactTune tian2023fine ($\blacktriangle$). Our model is more accurate than every baseline and produce at least as many claims as the search DPO.
  • Figure 2: Overview of the ReSearch algorithm. ReSearch combines two components: 1) Self-Evaluation where the model evaluates the expected accuracy $\hat{\rho}$ of its generated claims based on their self-consistency with all the generations produced by the model, and 2) Self-Prompting where the model incorporates the claims more likely than $\rho^*$ into its prompt to improve its generations at the next iteration. Finally, the resulting generations produced by the ReSearch algorithm plus the phrase expressing abstention are self-evaluated, ranked, and returned as synthetic data that can be used as desired.
  • Figure 3: Utility function contour plots. First, in sub-figure a), we can observe (as done by multiple methods such as tian2023fine) that using the average accuracy as a utility does not encourage more claims. In sub-figures b), c) and d), we observe that the utility function encourages the agent to produce as many claims as possible and to maximize the accuracy. It also elegantly ranks samples with different number of claims and accuracy. Furthermore, in subfigure b), we observe that the agent should abstain from a query and obtain a utility of 0 instead of a negative utility if it believes that less than target accuracy $\rho^*=20\%$ of the claims in its best sample are true. Finally, we observe a similar pattern for c) and d), where the target accuracy $\rho^*$ is set to 50% and 80% respectively. We also observe the advantages of the utility function ${\mathcal{U}}$ over the average probability empirically in \ref{['fig:reward_abaltion']}.
  • Figure 4: Pareto fronts. We observe that by varying $\rho$ we obtain agents with different behaviors in terms of number of claims and accuracy, where lower $\rho$ result in agents producing more claims but with higher inaccuracies, while high $\rho^*$ results in agents producing fewer claims but with higher accuracy. Overall, we observe that the ReSearch agents are on a higher Pareto front than the prompted baselines. Interestingly, we observe that Llama2 is not well calibrated for target accuracy $\rho^*$ above 0.7. But, we observe that the ReSearch agents are on a higher Pareto front than the prompted baselines.
  • Figure 5: Utility ($\lambda(\rho=0.5)$) as a function of popularity tiers. For the bottom, middle and top tiers, we want the LLMs to have large positive utility (dark blue). Overall, we observe that for the top tier all the models produce more true claims than false claims and obtain a high utility. However, for the bottom tier, the prompted models and some baseline models obtain a negative utility, meaning that they generate more false claims than true ones. Our trained models, on the other hand, achieve positive utility for all tiers and all datasets.
  • ...and 6 more figures