Iterative Foundation Model Fine-Tuning on Multiple Rewards
Pouya M. Ghari, Simone Sciabola, Ye Wang
TL;DR
This work tackles the challenge of fine-tuning foundation models when multiple, potentially conflicting rewards must be optimized. It introduces IterativeRS, an iterative multi-objective RL method that trains an expert for each objective and periodically merges them to form a shared policy, thereby balancing objective-specific skills with cross-objective coherence. The approach provides a convergence analysis under standard convexity and smoothness assumptions and demonstrates superior average rewards and stability (ICV) across small-molecule design, DNA sequence generation, and text summarization tasks when compared to MORLHF, Rewarded Soups, and a supervised baseline. The combination of theoretical guarantees and diverse empirical results suggests IterativeRS as a flexible framework for scalable, multi-objective foundation-model fine-tuning with practical impact in drug discovery, genomics, and natural language tasks.
Abstract
Fine-tuning foundation models has emerged as a powerful approach for generating objects with specific desired properties. Reinforcement learning (RL) provides an effective framework for this purpose, enabling models to generate outputs that maximize a given reward function. However, in many applications such as text generation and drug discovery, it can be suboptimal to optimize using a single reward signal, as multiple evaluation criteria are often necessary. This paper proposes a novel reinforcement learning-based method for fine-tuning foundation models using multiple reward signals. By employing an iterative fine-tuning strategy across these rewards, our approach generalizes state-of-the-art RL-based methods. We further provide a theoretical analysis that offers insights into the performance of multi-reward RL fine-tuning. Experimental results across diverse domains including text, biological sequence, and small molecule generation, demonstrate the effectiveness of the proposed algorithm compared to state-of-the-art baselines.
