Table of Contents
Fetching ...

SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

Dengcan Liu, Jiahao Li, Zheren Fu, Yi Tu, Jiajun Li, Zhendong Mao, Yongdong Zhang

TL;DR

SparseRM addresses the resource-intensive nature of reward modeling for LLM alignment by extracting preference-relevant features from intermediate representations with a Sparse Autoencoder. It constructs projection vectors along interpretable directions and trains a lightweight single-layer reward head, enabling robust preference scoring without fine-tuning the backbone. Across truthfulness, safety, and adversarial evaluation, SparseRM achieves competitive or superior performance while using less than 1% of trainable parameters and integrating smoothly into online iterative alignment loops. The approach also offers interpretability by linking latent directions to semantic preference cues and demonstrates stronger generalization under distributional shifts compared to dense representations.

Abstract

Reward models (RMs) are a core component in the post-training of large language models (LLMs), serving as proxies for human preference evaluation and guiding model alignment. However, training reliable RMs under limited resources remains challenging due to the reliance on large-scale preference annotations and the high cost of fine-tuning LLMs. To address this, we propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations, enabling the construction of a lightweight and interpretable reward model. SparseRM first employs SAE to decompose LLM representations into interpretable directions that capture preference-relevant features. The representations are then projected onto these directions to compute alignment scores, which quantify the strength of each preference feature in the representations. A simple reward head aggregates these scores to predict preference scores. Experiments on three preference modeling tasks show that SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters. Moreover, it integrates seamlessly into downstream alignment pipelines, highlighting its potential for efficient alignment.

SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

TL;DR

SparseRM addresses the resource-intensive nature of reward modeling for LLM alignment by extracting preference-relevant features from intermediate representations with a Sparse Autoencoder. It constructs projection vectors along interpretable directions and trains a lightweight single-layer reward head, enabling robust preference scoring without fine-tuning the backbone. Across truthfulness, safety, and adversarial evaluation, SparseRM achieves competitive or superior performance while using less than 1% of trainable parameters and integrating smoothly into online iterative alignment loops. The approach also offers interpretability by linking latent directions to semantic preference cues and demonstrates stronger generalization under distributional shifts compared to dense representations.

Abstract

Reward models (RMs) are a core component in the post-training of large language models (LLMs), serving as proxies for human preference evaluation and guiding model alignment. However, training reliable RMs under limited resources remains challenging due to the reliance on large-scale preference annotations and the high cost of fine-tuning LLMs. To address this, we propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations, enabling the construction of a lightweight and interpretable reward model. SparseRM first employs SAE to decompose LLM representations into interpretable directions that capture preference-relevant features. The representations are then projected onto these directions to compute alignment scores, which quantify the strength of each preference feature in the representations. A simple reward head aggregates these scores to predict preference scores. Experiments on three preference modeling tasks show that SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters. Moreover, it integrates seamlessly into downstream alignment pipelines, highlighting its potential for efficient alignment.

Paper Structure

This paper contains 22 sections, 10 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Comparison of traditional RM (Reward Model) and our proposed SparseRM. The SparseRM leverages the sparse autoencoder to extract interpretable preference features and then trains a lightweight reward head with significantly fewer parameters than traditional reward models.
  • Figure 2: The overview of our proposed work. We first conduct the SparseRM with a sparse autoencoder and then integrate it into the online iterative alignment framework. (a) SparseRM identifies preference-aware subspaces and trains a reward model using projection vectors. (b) Generated responses are filtered by SparseRM to improve alignment through iterative DPO training.
  • Figure 3: Performance comparison of different RMs across various datasets: using Gemma-2-9B-it as the backbone, SparseRM achieves the highest accuracy on TruthfulQA and outperforms most baselines on SafeRLHF and Red-Teaming, while using the fewest trainable parameters.
  • Figure 4: Comparison of SparseRM performance under different transformer layers and selected SAE latents $K$.
  • Figure 5: Cosine Similarity between Generated and Training Data in Sparse and Dense Spaces.
  • ...and 6 more figures