Table of Contents
Fetching ...

Rescue: Ranking LLM Responses with Partial Ordering to Improve Response Generation

Yikun Wang, Rui Zheng, Haoming Li, Qi Zhang, Tao Gui, Fei Liu

TL;DR

Rescue addresses the data-efficient customization of LLMs by ranking candidate task responses using a partial ordering rather than forcing a full consensus. It combines supervised fine-tuning with a ranking loss that compares candidate responses, yielding the objective $L_{Rescue}( heta) = L_{SFT}( heta) + \alpha L_{Rank}( heta)$, where $L_{SFT}( heta) = - \log \pi_\theta(y^*|x)$ and $L_{Rank}$ enforces a margin between top and competing responses. By evaluating on textual entailment (e-SNLI) and multi-document QA, Rescue shows that partial ordering strategies (e.g., Label Prioritization and Human-Label Hybrid) outperform full ordering and pure SFT, especially under data-scarce conditions. The method demonstrates robustness to noisy human judgments, reduces annotation costs, and improves both answer accuracy and explanation quality, offering a practical path for task-specific LLM customization. These results highlight the potential of partial ordering and ranking-based fine-tuning to enhance LLMs in domains with limited expert data and long-context reasoning tasks.

Abstract

Customizing LLMs for a specific task involves separating high-quality responses from lower-quality ones. This skill can be developed using supervised fine-tuning with extensive human preference data. However, obtaining a large volume of expert-annotated data is costly for most tasks. In this paper, we explore a novel method to optimize LLMs using ranking metrics. This method trains the model to prioritize the best responses from a pool of candidates created for a particular task. Rather than a traditional full ordering, we advocate for a partial ordering, as achieving consensus on the perfect order of candidate responses can be challenging. Our partial ordering is more robust, less sensitive to noise, and can be achieved with limited human annotations or through heuristic methods. We test our system's improved response generation ability using benchmark datasets, including textual entailment and multi-document question answering. We conduct ablation studies to understand crucial factors, such as how to gather candidate responses for a specific task, determine their most suitable order, and balance supervised fine-tuning with ranking metrics. Our approach, named Rescue, offers a promising avenue for enhancing the response generation and task accuracy of LLMs.

Rescue: Ranking LLM Responses with Partial Ordering to Improve Response Generation

TL;DR

Rescue addresses the data-efficient customization of LLMs by ranking candidate task responses using a partial ordering rather than forcing a full consensus. It combines supervised fine-tuning with a ranking loss that compares candidate responses, yielding the objective , where and enforces a margin between top and competing responses. By evaluating on textual entailment (e-SNLI) and multi-document QA, Rescue shows that partial ordering strategies (e.g., Label Prioritization and Human-Label Hybrid) outperform full ordering and pure SFT, especially under data-scarce conditions. The method demonstrates robustness to noisy human judgments, reduces annotation costs, and improves both answer accuracy and explanation quality, offering a practical path for task-specific LLM customization. These results highlight the potential of partial ordering and ranking-based fine-tuning to enhance LLMs in domains with limited expert data and long-context reasoning tasks.

Abstract

Customizing LLMs for a specific task involves separating high-quality responses from lower-quality ones. This skill can be developed using supervised fine-tuning with extensive human preference data. However, obtaining a large volume of expert-annotated data is costly for most tasks. In this paper, we explore a novel method to optimize LLMs using ranking metrics. This method trains the model to prioritize the best responses from a pool of candidates created for a particular task. Rather than a traditional full ordering, we advocate for a partial ordering, as achieving consensus on the perfect order of candidate responses can be challenging. Our partial ordering is more robust, less sensitive to noise, and can be achieved with limited human annotations or through heuristic methods. We test our system's improved response generation ability using benchmark datasets, including textual entailment and multi-document question answering. We conduct ablation studies to understand crucial factors, such as how to gather candidate responses for a specific task, determine their most suitable order, and balance supervised fine-tuning with ranking metrics. Our approach, named Rescue, offers a promising avenue for enhancing the response generation and task accuracy of LLMs.
Paper Structure (22 sections, 4 equations, 6 figures, 2 tables)

This paper contains 22 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: When LLMs provide accurate label predictions, they are frequently accompanied by high-quality explanations liu2023prudent. Building on this insight, we rank candidate explanations obtained from diverse sources into a partial order. Human responses are placed above model responses with correct labels, and these are prioritized over incorrect responses. In scenarios with limited human annotations, we use this hierarchy to teach the LLM to generate high-quality explanations, which in turn leads to more accurate label predictions.
  • Figure 2: For the Multi-doc QA task, we anchor responses in different parts of the context to produce a diverse set of answers. We generate five candidate responses per instance, one from the gold passage and four from random distractors.
  • Figure 3: Human evaluation results. Our partial ordering (PO) with label prioritization outperforms the SFT model with an overall win rate of 47%. While SFT shows comparable accuracy in automatic evaluation, it often relies on data artifacts for predictions DBLP:journals/corr/abs-1803-02324 and does not yield better explanations. Our PO method also outperforms other methods such as FO Similarity and the base Llama-2-7b model.
  • Figure 4: (Left) The influence of different $\alpha$ on task accuracy. We find that optimal performance is achieved with an $\alpha$ value between 0.01 to 0.1. (Right) We conduct experiments with a varying number of candidate responses per prompt. Results indicate that performance improvement can be achieved even with 3-4 candidate responses.
  • Figure 5: Left figure shows the log probabilities of human responses, while Middle and Right figures present those from Llama-2-7b and GPT-3.5-turbo-0613, respectively. We assign a length scaling factor, $\lambda$, of 0.85 to all model responses, maintaining a $\lambda$ of 1.0 for human responses. This approach effectively shifts the log probability score distributions of model responses (colored in red) closer to those of human ones, thereby minimizing margin violations.
  • ...and 1 more figures