Table of Contents
Fetching ...

Flexible Realignment of Language Models

Wenhong Zhu, Ruobing Xie, Weinan Zhang, Rui Wang

TL;DR

This work tackles the challenge of maintaining alignment in large language models by proposing a flexible realignment framework that operates during both training and inference. It introduces Training-time Realignment (TrRa), which uses a controllable fusion of reference and aligned model logits to guide post-training realignment, and Inference-time Realignment (InRa), which adds a lightweight layer adapter to enable dynamic alignment at inference via logit interpolation with a tunable parameter $\lambda$. The approach achieves substantial token-efficiency (e.g., up to $54.63\%$ token reduction with TrRa-iter) while preserving or improving performance on reasoning benchmarks, and enables smooth fast/slow thinking balancing for reasoning and dialogue tasks. By enabling flexible, parameter-efficient, and deployable realignment, the framework offers practical pathways to maintain robustness and user-specific alignment in evolving AI systems, with potential impacts across cost, safety, and user experience in real-world deployments.

Abstract

Realignment becomes necessary when a language model (LM) fails to meet expected performance. We propose a flexible realignment framework that supports quantitative control of alignment degree during training and inference. This framework incorporates Training-time Realignment (TrRa), which efficiently realigns the reference model by leveraging the controllable fusion of logits from both the reference and already aligned models. For example, TrRa reduces token usage by 54.63% on DeepSeek-R1-Distill-Qwen-1.5B without any performance degradation, outperforming DeepScaleR-1.5B's 33.86%. To complement TrRa during inference, we introduce a layer adapter that enables smooth Inference-time Realignment (InRa). This adapter is initialized to perform an identity transformation at the bottom layer and is inserted preceding the original layers. During inference, input embeddings are simultaneously processed by the adapter and the original layer, followed by the remaining layers, and then controllably interpolated at the logit level. We upgraded DeepSeek-R1-Distill-Qwen-7B from a slow-thinking model to one that supports both fast and slow thinking, allowing flexible alignment control even during inference. By encouraging deeper reasoning, it even surpassed its original performance.

Flexible Realignment of Language Models

TL;DR

This work tackles the challenge of maintaining alignment in large language models by proposing a flexible realignment framework that operates during both training and inference. It introduces Training-time Realignment (TrRa), which uses a controllable fusion of reference and aligned model logits to guide post-training realignment, and Inference-time Realignment (InRa), which adds a lightweight layer adapter to enable dynamic alignment at inference via logit interpolation with a tunable parameter . The approach achieves substantial token-efficiency (e.g., up to token reduction with TrRa-iter) while preserving or improving performance on reasoning benchmarks, and enables smooth fast/slow thinking balancing for reasoning and dialogue tasks. By enabling flexible, parameter-efficient, and deployable realignment, the framework offers practical pathways to maintain robustness and user-specific alignment in evolving AI systems, with potential impacts across cost, safety, and user experience in real-world deployments.

Abstract

Realignment becomes necessary when a language model (LM) fails to meet expected performance. We propose a flexible realignment framework that supports quantitative control of alignment degree during training and inference. This framework incorporates Training-time Realignment (TrRa), which efficiently realigns the reference model by leveraging the controllable fusion of logits from both the reference and already aligned models. For example, TrRa reduces token usage by 54.63% on DeepSeek-R1-Distill-Qwen-1.5B without any performance degradation, outperforming DeepScaleR-1.5B's 33.86%. To complement TrRa during inference, we introduce a layer adapter that enables smooth Inference-time Realignment (InRa). This adapter is initialized to perform an identity transformation at the bottom layer and is inserted preceding the original layers. During inference, input embeddings are simultaneously processed by the adapter and the original layer, followed by the remaining layers, and then controllably interpolated at the logit level. We upgraded DeepSeek-R1-Distill-Qwen-7B from a slow-thinking model to one that supports both fast and slow thinking, allowing flexible alignment control even during inference. By encouraging deeper reasoning, it even surpassed its original performance.

Paper Structure

This paper contains 62 sections, 2 theorems, 30 equations, 8 figures, 14 tables.

Key Result

Proposition 1

It can be equivalently written as

Figures (8)

  • Figure 1: Our InRa: The inputs are fed simultaneously into the layer adapter and the original bottom layer of the LM. The hidden states from both paths are propagated through all layers and merged at the logit level. The layer adapter enables flexible realignment even during inference.
  • Figure 2: Overview of attention and MLP component. Identity copy makes the last projector of each component with weight and bias to zero.
  • Figure 3: (a) All layers fine-tuning. (b) Fine-tuning on the added identity layer while keeping the original layers of the LM frozen.
  • Figure 4: Reasoning Performance on different models and benchmarks with our InRa, verifying the successful interpolation and extrapolation of realignment. $\lambda=0$ means merely using slowing thinking, while $\lambda=1$ indicates solely using fast thinking.
  • Figure 5: Comparison of two loss curves.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Theorem 1