Table of Contents
Fetching ...

Iterative Foundation Model Fine-Tuning on Multiple Rewards

Pouya M. Ghari, Simone Sciabola, Ye Wang

TL;DR

This work tackles the challenge of fine-tuning foundation models when multiple, potentially conflicting rewards must be optimized. It introduces IterativeRS, an iterative multi-objective RL method that trains an expert for each objective and periodically merges them to form a shared policy, thereby balancing objective-specific skills with cross-objective coherence. The approach provides a convergence analysis under standard convexity and smoothness assumptions and demonstrates superior average rewards and stability (ICV) across small-molecule design, DNA sequence generation, and text summarization tasks when compared to MORLHF, Rewarded Soups, and a supervised baseline. The combination of theoretical guarantees and diverse empirical results suggests IterativeRS as a flexible framework for scalable, multi-objective foundation-model fine-tuning with practical impact in drug discovery, genomics, and natural language tasks.

Abstract

Fine-tuning foundation models has emerged as a powerful approach for generating objects with specific desired properties. Reinforcement learning (RL) provides an effective framework for this purpose, enabling models to generate outputs that maximize a given reward function. However, in many applications such as text generation and drug discovery, it can be suboptimal to optimize using a single reward signal, as multiple evaluation criteria are often necessary. This paper proposes a novel reinforcement learning-based method for fine-tuning foundation models using multiple reward signals. By employing an iterative fine-tuning strategy across these rewards, our approach generalizes state-of-the-art RL-based methods. We further provide a theoretical analysis that offers insights into the performance of multi-reward RL fine-tuning. Experimental results across diverse domains including text, biological sequence, and small molecule generation, demonstrate the effectiveness of the proposed algorithm compared to state-of-the-art baselines.

Iterative Foundation Model Fine-Tuning on Multiple Rewards

TL;DR

This work tackles the challenge of fine-tuning foundation models when multiple, potentially conflicting rewards must be optimized. It introduces IterativeRS, an iterative multi-objective RL method that trains an expert for each objective and periodically merges them to form a shared policy, thereby balancing objective-specific skills with cross-objective coherence. The approach provides a convergence analysis under standard convexity and smoothness assumptions and demonstrates superior average rewards and stability (ICV) across small-molecule design, DNA sequence generation, and text summarization tasks when compared to MORLHF, Rewarded Soups, and a supervised baseline. The combination of theoretical guarantees and diverse empirical results suggests IterativeRS as a flexible framework for scalable, multi-objective foundation-model fine-tuning with practical impact in drug discovery, genomics, and natural language tasks.

Abstract

Fine-tuning foundation models has emerged as a powerful approach for generating objects with specific desired properties. Reinforcement learning (RL) provides an effective framework for this purpose, enabling models to generate outputs that maximize a given reward function. However, in many applications such as text generation and drug discovery, it can be suboptimal to optimize using a single reward signal, as multiple evaluation criteria are often necessary. This paper proposes a novel reinforcement learning-based method for fine-tuning foundation models using multiple reward signals. By employing an iterative fine-tuning strategy across these rewards, our approach generalizes state-of-the-art RL-based methods. We further provide a theoretical analysis that offers insights into the performance of multi-reward RL fine-tuning. Experimental results across diverse domains including text, biological sequence, and small molecule generation, demonstrate the effectiveness of the proposed algorithm compared to state-of-the-art baselines.

Paper Structure

This paper contains 22 sections, 2 theorems, 53 equations, 3 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Let the learning rate at step $t$ is set as $\eta_t = \frac{2}{\mu(\gamma+t)}$ where $\gamma = \max\{\frac{8L}{\mu},m\}-1$. Furthermore, let ${\bm{\theta}}_{\text{ref}}$ denote the policy parameter of the initial reference policy $\pi_{\text{ref}}$. Under assumptions A ass:1--A ass:3, the performanc where $\Delta^*$ be defined as:

Figures (3)

  • Figure 1: Pairwise scatter plots of generated molecules in the reward space for the three objectives.
  • Figure 2: Pairwise scatter plots of generated DNA sequences in the reward space for the three objectives.
  • Figure 3: Pairwise scatter plots of generated summaries in the reward space for the three objectives.

Theorems & Definitions (3)

  • Theorem 1
  • Lemma 1
  • proof