Table of Contents
Fetching ...

STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models

Mohammad Atif Quamar, Mohammad Areeb, Mikhail Kuznetsov, Muslum Ozgur Ozmen, Z. Berkay Celik

TL;DR

STARS addresses the alignment problem by introducing a segment-level, reward-guided rejection sampling mechanism into decoding. It defines a target Gibbs distribution that biases the base LM toward higher-reward segments and uses fixed-size token blocks to prune paths efficiently during generation. Across six LLMs and two alignment axes, STARS often outperforms fine-tuning methods like SFT and DPO and remains competitive with strong Best-of-N baselines, while also improving adversarial robustness. The approach offers a training-free, computationally efficient alternative to full-model fine-tuning for safer and more useful outputs in high-stakes applications.

Abstract

Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.

STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models

TL;DR

STARS addresses the alignment problem by introducing a segment-level, reward-guided rejection sampling mechanism into decoding. It defines a target Gibbs distribution that biases the base LM toward higher-reward segments and uses fixed-size token blocks to prune paths efficiently during generation. Across six LLMs and two alignment axes, STARS often outperforms fine-tuning methods like SFT and DPO and remains competitive with strong Best-of-N baselines, while also improving adversarial robustness. The approach offers a training-free, computationally efficient alternative to full-model fine-tuning for safer and more useful outputs in high-stakes applications.

Abstract

Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.

Paper Structure

This paper contains 32 sections, 4 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: An illustration of the STARS methodology. For a given prompt, multiple candidate segments are sampled from the base model. A Process Reward Model (PRM) scores each segment for alignment, and we use rejection sampling to reject/accept segments. This iterative process allows for the early pruning of undesirable paths (e.g., harmful suggestions or generic refusals) and effectively steers the model toward a helpful and safe trajectory.
  • Figure 2: Win-rate comparison of fine-tuning (SFT for IMDB, DPO for HH-RLHF) and STARS against vanilla decoding. STARS demonstrates performance that is highly competitive with, and often superior to, traditional fine-tuning methods.