Parameter Efficient Reinforcement Learning from Human Feedback

Hakim Sidahmed; Samrat Phatale; Alex Hutcheson; Zhuonan Lin; Zhang Chen; Zac Yu; Jarvis Jin; Simral Chaudhary; Roman Komarytsia; Christiane Ahlheim; Yonghao Zhu; Bowen Li; Saravanan Ganesh; Bill Byrne; Jessica Hoffmann; Hassan Mansoor; Wei Li; Abhinav Rastogi; Lucas Dixon

Parameter Efficient Reinforcement Learning from Human Feedback

Hakim Sidahmed, Samrat Phatale, Alex Hutcheson, Zhuonan Lin, Zhang Chen, Zac Yu, Jarvis Jin, Simral Chaudhary, Roman Komarytsia, Christiane Ahlheim, Yonghao Zhu, Bowen Li, Saravanan Ganesh, Bill Byrne, Jessica Hoffmann, Hassan Mansoor, Wei Li, Abhinav Rastogi, Lucas Dixon

TL;DR

This paper tackles the high computational and memory demands of Reinforcement Learning from Human Feedback (RLHF) by introducing Parameter-Efficient RLHF (PE-RLHF) that leverages LoRA adapters to fine-tune both reward modeling and policy components while freezing the backbone. Through extensive benchmarks across six datasets spanning text summarization, harmless/helpful responses, UI automation, and visual question answering, PE-RLHF achieves performance comparable to standard RLHF but with substantial resource savings: up to 90% faster RM training and up to 30% faster RL, along with memory reductions around 50% for reward modeling. The study provides thorough ablations over LoRA ranks and model sizes, showing that larger backbones benefit PE-RLHF, while rank has a limited effect on RM and a modest effect on RL. Overall, PE-RLHF offers a practical, scalable path to aligning large language and vision-language models with human preferences, enabling broader deployment while maintaining alignment quality.

Abstract

While Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language and Vision-Language Models (LLMs, and VLMs) with human preferences, its computational cost and complexity hamper its wider adoption. To alleviate some of the computational burden of fine-tuning, parameter efficient methods, like LoRA were introduced. In this work, we empirically evaluate the setup of Parameter Efficient Reinforcement Learning from Human Feedback (PE-RLHF) that leverages LoRA fine-tuning for Reward Modeling, and Reinforcement Learning. We benchmark the PE-RLHF setup on six diverse datasets spanning summarization, harmless/helpful response generation, UI automation, and visual question answering in terms of effectiveness of the trained models, and the training resources required. Our findings show, for the first time, that PE-RLHF achieves comparable performance to RLHF, while significantly reducing training time (up to 90% faster for reward models, and 30% faster for RL), and memory footprint (up to 50% reduction for reward models, and 27% for RL). We provide comprehensive ablations across LoRA ranks, and model sizes for both reward modeling and reinforcement learning. By mitigating the computational burden associated with RLHF, we push for a broader adoption of PE-RLHF as an alignment technique for LLMs and VLMs.

Parameter Efficient Reinforcement Learning from Human Feedback

TL;DR

Abstract

Paper Structure (50 sections, 3 equations, 4 figures, 8 tables)

This paper contains 50 sections, 3 equations, 4 figures, 8 tables.

Introduction
Parameter Efficient Reinforcement Learning from Human Feedback
Reward Model Training
Reinforcement Learning of Policy
Datasets and Tasks
Text Summarization:
Harmless Response Generation:
Helpful Response Generation:
UI Automation:
Visual Question Answering:
Experimental Setup and Metrics
Reward Modeling
Reinforcement Learning
Evaluations
Text Summarization:
...and 35 more sections

Figures (4)

Figure 1: Standard RM training (left) vs. PE-RLHF RM training (right). PE-RLHF RM only trains the LoRA adapters, while keeping the Language Model backbone frozen.
Figure 2: Standard RLHF (left) vs. PE-RLHF right. PE-RLHF only trains the LoRA adapters while keeping the Language Model backbone frozen.
Figure 3: PE-RLHF performs on par with standard RLHF. Both PE-RLHF and RLHF outperform SFT policies significantly on all tasks.
Figure 4: A trajectory of the send_email task from the UINav dataset. To complete this task, an agent should perform the following four steps: (a) Click on the compose button; (b) Type the email address; (c) Type the subject; (d) Type the email content. The action of clicking the send button is not shown due to space limitation.

Parameter Efficient Reinforcement Learning from Human Feedback

TL;DR

Abstract

Parameter Efficient Reinforcement Learning from Human Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (4)