Unsupervised Large Language Model Alignment for Information Retrieval via Contrastive Feedback

Qian Dong; Yiding Liu; Qingyao Ai; Zhijing Wu; Haitao Li; Yiqun Liu; Shuaiqiang Wang; Dawei Yin; Shaoping Ma

Unsupervised Large Language Model Alignment for Information Retrieval via Contrastive Feedback

Qian Dong, Yiding Liu, Qingyao Ai, Zhijing Wu, Haitao Li, Yiqun Liu, Shuaiqiang Wang, Dawei Yin, Shaoping Ma

TL;DR

Large language models often produce generic, indistinctive outputs that hinder information retrieval when documents are highly similar. The paper proposes Reinforcement Learning from Contrastive Feedback (RLCF), an unsupervised, group-wise alignment framework that forms groups of similar documents, generates per-document responses, and optimizes outputs via a group-wise reward $GRR$ using Proximal Policy Optimization with a KL penalty. By contrasting responses within document groups, RLCF yields more distinctive and informative outputs for document summarization, document expansion, and data augmentation in dense retrieval across English and Chinese LLMs and multiple parameter scales, outperforming existing alignment methods and approaching the performance of larger models such as GPT-3.5/4 in several tasks. The approach demonstrates practical IR benefits by improving output informativeness and discriminability without supervision, with broad potential for extension to additional domains.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various research domains, including the field of Information Retrieval (IR). However, the responses generated by off-the-shelf LLMs tend to be generic, i.e., cannot capture the distinctiveness of each document with similar content. This limits the performance of LLMs in IR because finding and distinguishing relevant documents from substantial similar documents is a typical problem in many IR tasks. To address this issue, we propose an unsupervised alignment method, namely Reinforcement Learning from Contrastive Feedback (RLCF), empowering LLMs to generate both high-quality and context-specific responses. Our approach constructs unsupervised contrastive feedback signals based on similar document groups, and adopts a reward function, named group-wise reciprocal rank, to optimize LLMs within a standard Proximal Policy Optimization. We conduct extensive experiments to evaluate the effectiveness of RLCF on LLMs built with different languages and parameter sizes on multiple downstream IR applications. RLCF significantly outperforms existing alignment methods, and RLCF-optimized LLMs demonstrate considerable improvement in generating responses with distinctiveness.

Unsupervised Large Language Model Alignment for Information Retrieval via Contrastive Feedback

TL;DR

using Proximal Policy Optimization with a KL penalty. By contrasting responses within document groups, RLCF yields more distinctive and informative outputs for document summarization, document expansion, and data augmentation in dense retrieval across English and Chinese LLMs and multiple parameter scales, outperforming existing alignment methods and approaching the performance of larger models such as GPT-3.5/4 in several tasks. The approach demonstrates practical IR benefits by improving output informativeness and discriminability without supervision, with broad potential for extension to additional domains.

Abstract

Paper Structure (22 sections, 8 equations, 10 figures, 4 tables)

This paper contains 22 sections, 8 equations, 10 figures, 4 tables.

Introduction
Related Work
Large Language Models
Alignment for LLM
LLM Applications in IR
Reinforcement Learning from Contrastive Feedback
Motivation
Data Construction
Model Optimization
Experimental Setup
LLM Applications in IR
Datasets
Implementation Details
Evaluation
Experimental Results
...and 7 more sections

Figures (10)

Figure 1: Illustrations of LLMs application in document summarization for similar documents.
Figure 2: The comparison between existing methods and RLCF. The dotted line represents that the reward score is returned to LLM for PPO optimization.
Figure 3: The framework of RLCF. We take the response $o_d$ as an example for illustration of group-wise contrastive feedback calculation. The green and blue rectangles represent the embedding of response and documents, respectively. The $\otimes$ represents the inner production operation between the embedding $E_{o_d}$ and $E_{\mathbb{G}}$.
Figure 4: The templates used in our RLCF framework.
Figure 5: Illustrations of document expansion.
...and 5 more figures

Unsupervised Large Language Model Alignment for Information Retrieval via Contrastive Feedback

TL;DR

Abstract

Unsupervised Large Language Model Alignment for Information Retrieval via Contrastive Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (10)