Table of Contents
Fetching ...

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

Weibin Liao, Xu Chu, Yasha Wang

TL;DR

This work identifies the limitations of binary Direct Preference Optimization (DPO) for tree-structured preferences and proposes Tree Preference Optimization (TPO), which learns directly from full preference trees. TPO frames LM alignment as a Preference List Ranking problem and introduces an Adaptive Step Reward to create fine-grained, step-level distinctions, addressing shared sub-trajectories in multi-step/multi-branch reasoning. Through extensive experiments on mathematical, coding, and reasoning tasks across multiple LLM backbones, TPO consistently outperforms DPO and SFT baselines and demonstrates notable improvements, including surpassing some larger models. The approach also analyzes the impact of reward values and list size, and discusses limitations such as potential catastrophic forgetting and data-imputation challenges, outlining future directions like improved data generation and memory-aware training. Overall, TPO advances alignment for complex, tree-structured reasoning by integrating Learn-to-Rank principles with adaptive, semantically aware reward modulation.

Abstract

In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain reasoning capabilities of large language models (LLMs). To this end, these studies employed LLMs to generate preference trees via Tree-of-thoughts (ToT) and sample the paired preference responses required by the DPO algorithm. However, the DPO algorithm based on binary preference optimization is unable to learn multiple responses with varying degrees of preference/dispreference that provided by the preference trees, resulting in incomplete preference learning. In this work, we introduce Tree Preference Optimization (TPO), that does not sample paired preference responses from the preference tree; instead, it directly learns from the entire preference tree during the fine-tuning. Specifically, TPO formulates the language model alignment as a Preference List Ranking problem, where the policy can potentially learn more effectively from a ranked preference list of responses given the prompt. In addition, to further assist LLMs in identifying discriminative steps within long-chain reasoning and increase the relative reward margin in the preference list, TPO utilizes Adaptive Step Reward to adjust the reward values of each step in trajectory for performing fine-grained preference optimization. We carry out extensive experiments on mathematical reasoning tasks to evaluate TPO. The experimental results indicate that TPO consistently outperforms DPO across five public large language models on four datasets. Our code is publicly available at https://github.com/MrBlankness/TPO.git.

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

TL;DR

This work identifies the limitations of binary Direct Preference Optimization (DPO) for tree-structured preferences and proposes Tree Preference Optimization (TPO), which learns directly from full preference trees. TPO frames LM alignment as a Preference List Ranking problem and introduces an Adaptive Step Reward to create fine-grained, step-level distinctions, addressing shared sub-trajectories in multi-step/multi-branch reasoning. Through extensive experiments on mathematical, coding, and reasoning tasks across multiple LLM backbones, TPO consistently outperforms DPO and SFT baselines and demonstrates notable improvements, including surpassing some larger models. The approach also analyzes the impact of reward values and list size, and discusses limitations such as potential catastrophic forgetting and data-imputation challenges, outlining future directions like improved data generation and memory-aware training. Overall, TPO advances alignment for complex, tree-structured reasoning by integrating Learn-to-Rank principles with adaptive, semantically aware reward modulation.

Abstract

In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain reasoning capabilities of large language models (LLMs). To this end, these studies employed LLMs to generate preference trees via Tree-of-thoughts (ToT) and sample the paired preference responses required by the DPO algorithm. However, the DPO algorithm based on binary preference optimization is unable to learn multiple responses with varying degrees of preference/dispreference that provided by the preference trees, resulting in incomplete preference learning. In this work, we introduce Tree Preference Optimization (TPO), that does not sample paired preference responses from the preference tree; instead, it directly learns from the entire preference tree during the fine-tuning. Specifically, TPO formulates the language model alignment as a Preference List Ranking problem, where the policy can potentially learn more effectively from a ranked preference list of responses given the prompt. In addition, to further assist LLMs in identifying discriminative steps within long-chain reasoning and increase the relative reward margin in the preference list, TPO utilizes Adaptive Step Reward to adjust the reward values of each step in trajectory for performing fine-grained preference optimization. We carry out extensive experiments on mathematical reasoning tasks to evaluate TPO. The experimental results indicate that TPO consistently outperforms DPO across five public large language models on four datasets. Our code is publicly available at https://github.com/MrBlankness/TPO.git.

Paper Structure

This paper contains 46 sections, 9 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: The framework of TPO: TPO regards preference modeling as a more general Preference List Ranking (PLR) problem and employs an Adaptive Step Reward for achieving finer-grained preference optimization.
  • Figure 2: (a) illustrates the data generation pipeline we used, where we start from the intermediate steps of the original correct reasoning trajectories and generate new reasoning trajectories step by step. The $\bullet$ steps represent preferred reasoning steps, the $\bullet$ steps denote dispreferred reasoning steps, and the $\bullet$ steps indicate reasoning steps with unknown preference. (b) shows how we introduced ChatGPT to score each reasoning trajectory, with scores ranging from $\in \left[-100, 100\right]$. We provided ChatGPT with correct reasoning trajectories as a reference and employed ReACT to improve score credibility. (c) presents the distribution of reasoning trajectories across various score intervals.
  • Figure 3: (a) illustrates a comparison between TPO and DPO using various reward value distributions for dispreferred responses on the ASDiv and GSM-Plus datasets. The numbers in the legend following each group of DPO algorithms represent the mean and standard deviation of the reward values for dispreferred responses. The results indicate that TPO consistently outperforms DPO. (b) shows performance of TPO with different list sizes on the ASDiv and GSM-Plus datasets. TPO benefits more and monotonically as the list size increases.
  • Figure 4: Study of Reward Margins.