Table of Contents
Fetching ...

Small Reward Models via Backward Inference

Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi, Yulia Tsvetkov

TL;DR

FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference that enables reliable reward modeling in downscaled regimes where judgment methods fail, is proposed.

Abstract

Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at https://github.com/yikee/FLIP.

Small Reward Models via Backward Inference

TL;DR

FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference that enables reliable reward modeling in downscaled regimes where judgment methods fail, is proposed.

Abstract

Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at https://github.com/yikee/FLIP.
Paper Structure (49 sections, 20 equations, 9 figures, 18 tables)

This paper contains 49 sections, 20 equations, 9 figures, 18 tables.

Figures (9)

  • Figure 1: While LM judges can be misled, FLIP effectively identifies off-topic, instruction-misaligned, and factually incorrect responses via backward inference.
  • Figure 2: Graphical models of LLM-as-a-Judge and FLIP. Shaded nodes denote observed variables and unshaded nodes denote prediction targets. FLIP samples an inferred instruction $x'$ conditioned on the response $y$, and defines the reward $r$ as the similarity between the inferred and the original instructions.
  • Figure 3: Overview of FLIP. Given a response, we use a LM to infer the instruction that would most plausibly generate the response, and use the F1 score between the inferred and original instructions as the reward.
  • Figure 4: Test-time scaling with parallel sampling results . OLMo 2 results are averaged across 1B and 7B instruct variants, while Llama 3 results are averaged across 1B, 3B, and 8B instruct variants. See results of individual models in Appendix \ref{['sec:bon-appendix']}. FLIP substantially outperforms the baselines, achieving higher performance with greater stability.
  • Figure 5: RewardBench2 performance across different response lengths. We only consider instances where all four candidate responses are of the same type. FLIP is particularly effective for longer responses.
  • ...and 4 more figures