Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization

Subhojyoti Mukherjee; Viet Dac Lai; Raghavendra Addanki; Ryan Rossi; Seunghyun Yoon; Trung Bui; Anup Rao; Jayakumar Subramanian; Branislav Kveton

Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization

Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton

TL;DR

This work presents a practical offline RL framework for large language models by recasting RL as reward-weighted fine-tuning. It introduces two algorithms, Reward-Weighted Fine-Tuning (Refit) and Standardized Reward-Weighted Fine-Tuning (Swift), which optimize an offline bound of the online RL objective using standard SFT-style updates, avoiding token-level propensity ratios. The methods are demonstrated on multi-turn QA tasks across diverse datasets, showing direct reward optimization yields major gains in both task rewards and language quality over SFT- and DPO-based baselines. Standardizing rewards in Swift reduces gradient variance and improves robustness, while Refit offers a straightforward, weighted-SFT interpretation. The results suggest direct reward optimization with offline RL can substantially improve conversation policies, with the caveat of higher computational demands and dependence on logged data quality.

Abstract

Offline reinforcement learning (RL) is a variant of RL where the policy is learned from a previously collected dataset of trajectories and rewards. In our work, we propose a practical approach to offline RL with large language models (LLMs). We recast the problem as reward-weighted fine-tuning, which can be solved using similar techniques to supervised fine-tuning (SFT). To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions. Our work stands in a stark contrast to state-of-the-art methods in this domain, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize for rewards. We compare to them empirically, and report major gains in both optimized rewards and language quality.

Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization

TL;DR

Abstract

Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (4)