Table of Contents
Fetching ...

Improving Instruction Following in Language Models through Proxy-Based Uncertainty Estimation

JoonHo Lee, Jae Oh Woo, Juree Seok, Parisa Hassanzadeh, Wooseok Jang, JuYoun Son, Sima Didari, Baruch Gutow, Heng Hao, Hankyu Moon, Wenjun Hu, Yeong-Dae Kwon, Taehee Lee, Seungjai Min

TL;DR

This work tackles the challenge of assessing instruction-following quality in language models under linguistic ambiguity by introducing the Uncertainty-aware Reward Model (URM), a Bayesian proxy that outputs both rewards and their inherent uncertainty. URM converts preference signals into probabilistic judgments and quantifies epistemic, aleatoric, and Balanced Entropy-based uncertainty to guide data curation and training. By integrating URM into uncertainty-aware training objectives (UDPO) and uncertainty-conditioned policy optimization (UCPO), the approach improves instruction-following performance across datasets and benchmarks such as MT-Bench and Vicuna-Bench, and across models ranging from 1.4B to 13B parameters. The results demonstrate that uncertainty-guided curricula and objectives can substantially enhance data efficiency and reliability, paving a path for more robust, human-aligned language models with practical impact on safety and usefulness.

Abstract

Assessing response quality to instructions in language models is vital but challenging due to the complexity of human language across different contexts. This complexity often results in ambiguous or inconsistent interpretations, making accurate assessment difficult. To address this issue, we propose a novel Uncertainty-aware Reward Model (URM) that introduces a robust uncertainty estimation for the quality of paired responses based on Bayesian approximation. Trained with preference datasets, our uncertainty-enabled proxy not only scores rewards for responses but also evaluates their inherent uncertainty. Empirical results demonstrate significant benefits of incorporating the proposed proxy into language model training. Our method boosts the instruction following capability of language models by refining data curation for training and improving policy optimization objectives, thereby surpassing existing methods by a large margin on benchmarks such as Vicuna and MT-bench. These findings highlight that our proposed approach substantially advances language model training and paves a new way of harnessing uncertainty within language models.

Improving Instruction Following in Language Models through Proxy-Based Uncertainty Estimation

TL;DR

This work tackles the challenge of assessing instruction-following quality in language models under linguistic ambiguity by introducing the Uncertainty-aware Reward Model (URM), a Bayesian proxy that outputs both rewards and their inherent uncertainty. URM converts preference signals into probabilistic judgments and quantifies epistemic, aleatoric, and Balanced Entropy-based uncertainty to guide data curation and training. By integrating URM into uncertainty-aware training objectives (UDPO) and uncertainty-conditioned policy optimization (UCPO), the approach improves instruction-following performance across datasets and benchmarks such as MT-Bench and Vicuna-Bench, and across models ranging from 1.4B to 13B parameters. The results demonstrate that uncertainty-guided curricula and objectives can substantially enhance data efficiency and reliability, paving a path for more robust, human-aligned language models with practical impact on safety and usefulness.

Abstract

Assessing response quality to instructions in language models is vital but challenging due to the complexity of human language across different contexts. This complexity often results in ambiguous or inconsistent interpretations, making accurate assessment difficult. To address this issue, we propose a novel Uncertainty-aware Reward Model (URM) that introduces a robust uncertainty estimation for the quality of paired responses based on Bayesian approximation. Trained with preference datasets, our uncertainty-enabled proxy not only scores rewards for responses but also evaluates their inherent uncertainty. Empirical results demonstrate significant benefits of incorporating the proposed proxy into language model training. Our method boosts the instruction following capability of language models by refining data curation for training and improving policy optimization objectives, thereby surpassing existing methods by a large margin on benchmarks such as Vicuna and MT-bench. These findings highlight that our proposed approach substantially advances language model training and paves a new way of harnessing uncertainty within language models.
Paper Structure (64 sections, 12 equations, 11 figures, 8 tables)

This paper contains 64 sections, 12 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Uncertainty distributions evaluated using the proposed URM (Uncertainty-aware Reward Model) for individual rewards in the instruction tuning or SFT data (a, b) and reward gaps in the preference data (c, d) are illustrated. These results show that even for responses (or preferences) with an identical reward (or reward gap), their uncertainty is distributed across a wide range.
  • Figure 2: Our proposed proxy, Uncertainty-aware Reward Model (URM), is trained to predict response rewards in preference data. It employs Monte-Carlo dropout for Bayesian approximation to output reward distributions while estimating robust uncertainty from them.
  • Figure 3: Uncertainty quantification framework.
  • Figure 4: Uncertainties w.r.t. rewards or reward gaps are illustrated. Lower reward gaps lead to higher uncertainty as in (a).
  • Figure 5: A poorly constructed training curriculum can lead to significantly inferior model performance for the same preference training dataset.
  • ...and 6 more figures