Table of Contents
Fetching ...

Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models

Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Susumu Takeuchi

TL;DR

Portable Reward Tuning (PRT) reframes fine-tuning as reward learning by training an explicit reward r_θ(x,y) and deriving a PRT model π_θ(y|x) that can be combined with any compatible foundation model via a KL-regularized closed-form. This enables reuse of tuning across evolving pretrained models, reducing inference overhead compared to inference-time tuning methods that require multiple models at inference. The paper provides theoretical backing, including a KL-regularized reward maximization foundation and a PAC-Bayesian perspective with optional Entropy Maximization for better generalization. Empirically, PRT achieves comparable accuracy to EFT on vision and language tasks while lowering inference cost and displaying favorable memory and speed characteristics, with qualitative analyses illustrating how the reward shapes token-level outputs and reasoning paths. Overall, PRT offers a scalable, architecture-agnostic approach to reusable fine-tuning across changing pretrained models, with practical impact for deploying up-to-date models efficiently.

Abstract

While foundation models have been exploited for various expert tasks through fine-tuning, any foundation model will become outdated due to its old knowledge or limited capability. Thus the underlying foundation model should be eventually replaced by new ones, which leads to repeated cost of fine-tuning these new models. Existing work addresses this problem by inference-time tuning, i.e., modifying the output probabilities from the new foundation model with the outputs from the old foundation model and its fine-tuned model, which involves an additional overhead in inference by the latter two models. In this paper, we propose a new fine-tuning principle, Portable Reward Tuning (PRT), that reduces the inference overhead by its nature, based on the reformulation of fine-tuning as the reward maximization. Specifically, instead of fine-tuning parameters of the foundation models, PRT trains the reward model explicitly through the same loss function as in fine-tuning. During inference, the reward model can be used with any foundation model (with the same set of vocabularies or labels) through the formulation of reward maximization. Experimental results, covering both vision and language models, demonstrate that the PRT-trained model can achieve comparable accuracy to the existing work of inference-time tuning, with less inference cost.

Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models

TL;DR

Portable Reward Tuning (PRT) reframes fine-tuning as reward learning by training an explicit reward r_θ(x,y) and deriving a PRT model π_θ(y|x) that can be combined with any compatible foundation model via a KL-regularized closed-form. This enables reuse of tuning across evolving pretrained models, reducing inference overhead compared to inference-time tuning methods that require multiple models at inference. The paper provides theoretical backing, including a KL-regularized reward maximization foundation and a PAC-Bayesian perspective with optional Entropy Maximization for better generalization. Empirically, PRT achieves comparable accuracy to EFT on vision and language tasks while lowering inference cost and displaying favorable memory and speed characteristics, with qualitative analyses illustrating how the reward shapes token-level outputs and reasoning paths. Overall, PRT offers a scalable, architecture-agnostic approach to reusable fine-tuning across changing pretrained models, with practical impact for deploying up-to-date models efficiently.

Abstract

While foundation models have been exploited for various expert tasks through fine-tuning, any foundation model will become outdated due to its old knowledge or limited capability. Thus the underlying foundation model should be eventually replaced by new ones, which leads to repeated cost of fine-tuning these new models. Existing work addresses this problem by inference-time tuning, i.e., modifying the output probabilities from the new foundation model with the outputs from the old foundation model and its fine-tuned model, which involves an additional overhead in inference by the latter two models. In this paper, we propose a new fine-tuning principle, Portable Reward Tuning (PRT), that reduces the inference overhead by its nature, based on the reformulation of fine-tuning as the reward maximization. Specifically, instead of fine-tuning parameters of the foundation models, PRT trains the reward model explicitly through the same loss function as in fine-tuning. During inference, the reward model can be used with any foundation model (with the same set of vocabularies or labels) through the formulation of reward maximization. Experimental results, covering both vision and language models, demonstrate that the PRT-trained model can achieve comparable accuracy to the existing work of inference-time tuning, with less inference cost.

Paper Structure

This paper contains 42 sections, 3 theorems, 16 equations, 41 figures, 3 tables, 2 algorithms.

Key Result

Proposition 3.1

There is a one-to-one correspondence between fine-tuned models and rewards, which preserves their accuracy: where $\pi_{\mathrm{ft}}(y|x)$ is mapped to the implicit reward $\log(\pi_{\mathrm{ft}}(y|x) / \pi_{\mathrm{pt}}(y|x))$.

Figures (41)

  • Figure 1: An overview of our approach of portable reward tuning (PRT) compared with the previous work of inference-time tuning, emulated fine-tuning (EFT). In training phase, we tune the reward model $r_\theta(x,y)$ instead of tuning a given pretrained model, through the same loss and dataset, which leads to the reduced cost in inference with another pretrained model.
  • Figure 2: Evaluations of inference-time tuned models for vision tasks. Each subcaption refers to the source pretrained model, and the labels in x-axis are target pretrained models. Pretrained means the zero-shot classification by each target model as a baseline, and FT means the fine-tuned target model as an oracle result.
  • Figure 3: Evaluations of inference-time instruction-tuned models on GSM8k and IFEval benchmarks. Each subcaption refers to the source pretrained model, and the labels in x-axis are target pretrained models. Pretrained means the zero-shot inference by each target model as a baseline, and Instruct means the instruct-tuned target model as an oracle result.
  • Figure 4: Inference-time tuning from Qwen2-0.5B to the Qwen2.5 models with various sizes.
  • Figure 5: Next Token Candidates Following "... John has 2R candies. James has 6"
  • ...and 36 more figures

Theorems & Definitions (6)

  • Proposition 3.1
  • proof
  • Proposition 3.2
  • proof
  • Proposition 3.3
  • proof