Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective
Victor Gallego
TL;DR
This work reframes RL-based alignment of LLMs as Bayesian inference by introducing distilled Self-Critique (dSC), which refines outputs through a Gibbs-sampler chain (critique and revision) and then distills the results into a fine-tuned model using synthetic data only. The posterior $p(x|c,r) \propto p(x|c) \exp{(r(x))}$ guides the sampling, while a Metropolis step ensures high-reward revisions; a LoRA-based distillation minimizes $\mathbb{E}_{p(x|c,r)} [\log p(x|c,r) - \log p_{\theta}(x|c)]$ to amortize the process. Experiments across safety, sentiment, and privacy demonstrate that dSC, especially with the acceptance step, yields superior safety scores and beneficial downstream transfer, using synthetic data and publicly available reward-models. The work provides a scalable, cheaper alternative to traditional RLHF/RLAIF with broad practical implications for regulated and private deployment of LLMs. Future directions include exploring alternative divergences and zero-shot reward models to further enhance alignment.
Abstract
This paper proposes an interpretation of RLAIF as Bayesian inference by introducing distilled Self-Critique (dSC), which refines the outputs of a LLM through a Gibbs sampler that is later distilled into a fine-tuned model. Only requiring synthetic data, dSC is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align LLMs. Code released at \url{https://github.com/vicgalle/distilled-self-critique}.
