Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective

Victor Gallego

Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective

Victor Gallego

TL;DR

This work reframes RL-based alignment of LLMs as Bayesian inference by introducing distilled Self-Critique (dSC), which refines outputs through a Gibbs-sampler chain (critique and revision) and then distills the results into a fine-tuned model using synthetic data only. The posterior $p(x|c,r) \propto p(x|c) \exp{(r(x))}$ guides the sampling, while a Metropolis step ensures high-reward revisions; a LoRA-based distillation minimizes $\mathbb{E}_{p(x|c,r)} [\log p(x|c,r) - \log p_{\theta}(x|c)]$ to amortize the process. Experiments across safety, sentiment, and privacy demonstrate that dSC, especially with the acceptance step, yields superior safety scores and beneficial downstream transfer, using synthetic data and publicly available reward-models. The work provides a scalable, cheaper alternative to traditional RLHF/RLAIF with broad practical implications for regulated and private deployment of LLMs. Future directions include exploring alternative divergences and zero-shot reward models to further enhance alignment.

Abstract

This paper proposes an interpretation of RLAIF as Bayesian inference by introducing distilled Self-Critique (dSC), which refines the outputs of a LLM through a Gibbs sampler that is later distilled into a fine-tuned model. Only requiring synthetic data, dSC is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align LLMs. Code released at \url{https://github.com/vicgalle/distilled-self-critique}.

Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective

TL;DR

guides the sampling, while a Metropolis step ensures high-reward revisions; a LoRA-based distillation minimizes

to amortize the process. Experiments across safety, sentiment, and privacy demonstrate that dSC, especially with the acceptance step, yields superior safety scores and beneficial downstream transfer, using synthetic data and publicly available reward-models. The work provides a scalable, cheaper alternative to traditional RLHF/RLAIF with broad practical implications for regulated and private deployment of LLMs. Future directions include exploring alternative divergences and zero-shot reward models to further enhance alignment.

Abstract

Paper Structure (10 sections, 3 figures, 7 tables)

This paper contains 10 sections, 3 figures, 7 tables.

Introduction and Related Work
Distilled Self-Critique: a Bayesian Perspective
Experiments
Conclusions
Comparison between Frameworks
Experiment Details and Additional Results
Avoiding harmful behaviors.
Avoiding negative sentiment.
Privacy-preserving generation.
Generated Samples

Figures (3)

Figure 1: Illustrative diagram of the dSC framework.
Figure 2: Experiment results for sentiment task.
Figure 3: Experiment results for the privacy-preserving generation task.

Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective

TL;DR

Abstract

Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (3)