L3Ms -- Lagrange Large Language Models
Guneet S. Dhillon, Xingjian Shi, Yee Whye Teh, Alex Smola
TL;DR
L3Ms reframes SFT and alignment as a unified constrained optimization problem, enabling application-specific guarantees on preference properties through average constraints instead of heuristic reward weighting. By incorporating a relaxed logarithmic barrier, the method gradually enforces constraints during fine-tuning, linking barrier gradients to Lagrange multipliers and avoiding deviations from an SFT anchor. The approach demonstrates that length-constrained and safety-oriented preferences (e.g., Helpful/Harmless) can be satisfied without sacrificing task performance, and with improved efficiency relative to saddle-point methods that rely on separate SFT models. Empirical results on instruction-following with UltraChat data show L3Ms can tailor responses (e.g., concise vs. verbose) while maintaining competitive perplexities, highlighting the practical impact of principled constraint-driven customization for LLM deployment.
Abstract
Supervised fine-tuning (SFT) and alignment of large language models (LLMs) are key steps in providing a good user experience. However, the concept of an appropriate alignment is inherently application-dependent, and current methods often rely on heuristic choices to drive optimization. In this work, we formulate SFT and alignment as a constrained optimization problem: the LLM is fine-tuned on a task while being required to meet application-specific requirements, without resorting to heuristics. To solve this, we propose Lagrange Large Language Models (L3Ms), which employ logarithmic barriers to enforce the constraints. This approach allows for the customization of L3Ms across diverse applications while avoiding heuristic-driven processes. We experimentally demonstrate the versatility and efficacy of L3Ms in achieving tailored alignments for various applications.
