Table of Contents
Fetching ...

Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

Haihui Pan, Yuzhong Hong, Shaoke Lv, Junwei Bao, Hongfei Jiang, Yang Song

TL;DR

This work theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity, and proposes the Quality-constrained Entropy Maximization Policy Optimization (QEMPO), which aims to maximize the output entropy of the policy while ensuring output quality.

Abstract

Recent research indicates that while alignment methods significantly improve the quality of large language model(LLM) outputs, they simultaneously reduce the diversity of the models' output. Although some methods have been proposed to enhance LLM output diversity, they often come at the cost of reduced performance. In this work, we first theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity. To enhance the diversity of LLM outputs while ensuring quality, we propose the Quality-constrained Entropy Maximization Policy Optimization (QEMPO). QEMPO aims to maximize the output entropy of the policy while ensuring output quality. By adding different constraints to QEMPO, we obtain different policies. To optimize policies, we propose both online and offline training methods. Experiments validate that QEMPO achieves performance comparable to or even better than RLHF while improving output diversity.

Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

TL;DR

This work theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity, and proposes the Quality-constrained Entropy Maximization Policy Optimization (QEMPO), which aims to maximize the output entropy of the policy while ensuring output quality.

Abstract

Recent research indicates that while alignment methods significantly improve the quality of large language model(LLM) outputs, they simultaneously reduce the diversity of the models' output. Although some methods have been proposed to enhance LLM output diversity, they often come at the cost of reduced performance. In this work, we first theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity. To enhance the diversity of LLM outputs while ensuring quality, we propose the Quality-constrained Entropy Maximization Policy Optimization (QEMPO). QEMPO aims to maximize the output entropy of the policy while ensuring output quality. By adding different constraints to QEMPO, we obtain different policies. To optimize policies, we propose both online and offline training methods. Experiments validate that QEMPO achieves performance comparable to or even better than RLHF while improving output diversity.
Paper Structure (20 sections, 14 theorems, 62 equations, 3 figures, 3 tables)

This paper contains 20 sections, 14 theorems, 62 equations, 3 figures, 3 tables.

Key Result

Proposition 1

For the optimization problem: where $\mathrm{R} = \mathbb{E}{\pi^{}_{\mathrm{RLHF}}}[r(x,y)]$. The analytical solution that minimizes this optimization objective is $\pi (y|x) = \frac{\pi_{\mathrm{ref}}(y|x) \exp( \lambda r(x,y)) }{Z(x)}$ where $Z(x) = \sum_{y}\pi_{\mathrm{ref}}(y|x) \exp( \lambda r(x,y))$ and $\lambda$ is the

Figures (3)

  • Figure 1: A schematic diagram of the output space distributions of three policies with varying quality and diversity. Here, $(y^{+}_{1}, \cdots, y^{+}_{n})$ represents the set of outputs that align with our preferences, while $(y^{-}_{1}, \cdots, y^{-}_{m})$ denotes the set of outputs that do not align with our preferences. (a) represents a policy whose outputs exhibit high diversity and very high quality; (b) represents a policy whose outputs exhibit high diversity but lower quality; (c) represents a policy whose outputs have limited diversity but possess high quality.
  • Figure 2: Diversity and quality across different models and methods. Distinct colors represent different base models, while varying shapes denote different approaches.
  • Figure 3: Pass@k results of RLHF, QEMPO, and QEMPO-KL on different datasets.

Theorems & Definitions (22)

  • Proposition 1
  • Proposition 2
  • Corollary 3.0
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Proposition 6
  • Proposition 6
  • proof
  • Proposition 6
  • ...and 12 more