Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

Zhiwei Tang; Dmitry Rybin; Tsung-Hui Chang

Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

Zhiwei Tang, Dmitry Rybin, Tsung-Hui Chang

TL;DR

This work addresses optimizing a black-box objective when feedback is available only through rankings. It introduces ZO-RankSGD, a zeroth-order algorithm that builds descent directions from ranking information via a rank-based estimator and a DAG-based translation to pairwise comparisons, with rigorous variance analysis tied to the ranking graph $(m,k)$. The authors prove convergence to a stationary point and propose a practical line-search scheme to adapt step sizes using ranking oracles. Empirical results across simple functions, RL tasks, and diffusion-model image generation demonstrate competitive performance, robustness to ranking noise, and meaningful gains from human-in-the-loop feedback. This framework advances AI alignment by enabling direct optimization with human preferences when exact objective values are inaccessible.

Abstract

In this study, we delve into an emerging optimization challenge involving a black-box objective function that can only be gauged via a ranking oracle-a situation frequently encountered in real-world scenarios, especially when the function is evaluated by human judges. Such challenge is inspired from Reinforcement Learning with Human Feedback (RLHF), an approach recently employed to enhance the performance of Large Language Models (LLMs) using human guidance. We introduce ZO-RankSGD, an innovative zeroth-order optimization algorithm designed to tackle this optimization problem, accompanied by theoretical assurances. Our algorithm utilizes a novel rank-based random estimator to determine the descent direction and guarantees convergence to a stationary point. Moreover, ZO-RankSGD is readily applicable to policy optimization problems in Reinforcement Learning (RL), particularly when only ranking oracles for the episode reward are available. Last but not least, we demonstrate the effectiveness of ZO-RankSGD in a novel application: improving the quality of images generated by a diffusion generative model with human ranking feedback. Throughout experiments, we found that ZO-RankSGD can significantly enhance the detail of generated images with only a few rounds of human feedback. Overall, our work advances the field of zeroth-order optimization by addressing the problem of optimizing functions with only ranking feedback, and offers a new and effective approach for aligning Artificial Intelligence (AI) with human intentions.

Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

TL;DR

. The authors prove convergence to a stationary point and propose a practical line-search scheme to adapt step sizes using ranking oracles. Empirical results across simple functions, RL tasks, and diffusion-model image generation demonstrate competitive performance, robustness to ranking noise, and meaningful gains from human-in-the-loop feedback. This framework advances AI alignment by enabling direct optimization with human preferences when exact objective values are inaccessible.

Abstract

Paper Structure (21 sections, 8 theorems, 85 equations, 16 figures, 3 algorithms)

This paper contains 21 sections, 8 theorems, 85 equations, 16 figures, 3 algorithms.

Introduction
Related works
Finding descent direction from the ranking information
A comparison-based estimator for descent direction
From ranking information to pairwise comparison
ZO-RankSGD: Zeroth-Order Rank-based Stochastic Gradient Descent
Theoretical guarantee of ZO-RankSGD
Line search via ranking oracle
Experiments
Simple functions
Reinforcement Learning with ranking oracles
Taming Diffusion Generative Model with Human Feedback
Conclusion
A simplified expression for \ref{['eq:rank_grad_est']}
Missing Proof
...and 6 more sections

Key Result

Lemma 1

For any $x\in{\mathbb{R}}^d$, we have where $C_d\geq 0$ is some constant that only depends on $d$.

Figures (16)

Figure 1: Application of our proposed algorithm on enhancing the quality of images generated from Stable Diffusion with human ranking feedback. At each iteration of this human-in-the-loop optimization, we use Stable Diffusion to generate multiple images by perturbing the latent embedding with random noise, which are then ranked by humans based on their quality. After that, the ranking information is leveraged to update the latent embedding.
Figure 2: The corresponding DAG for the ranking result $O_f^{(5,3)}(x_1,x_2,x_3,x_4$$,x_5)=(1,3,2)$.
Figure 3: Performance of different algorithms.
Figure 4: Performance of ZO-RankSGD under different combinations of $m$ and $k$.
Figure 5: Perfomance of ZO-RankSGD and CMA-ES on three MuJoCo environments
...and 11 more figures

Theorems & Definitions (18)

Definition 1: $(m,k)$-ranking oracle
Lemma 1
Remark 1
Lemma 2
Definition 2
Lemma 3
Lemma 4
Theorem 1
Corollary 1
proof : Proof of Lemma \ref{['lemma:gradient']}
...and 8 more

Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

TL;DR

Abstract

Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (18)