Aligning LLMs through Multi-perspective User Preference Ranking-based Feedback for Programming Question Answering

Hongyu Yang; Liyang He; Min Hou; Shuanghong Shen; Rui Li; Jiahui Hou; Jianhui Ma; Junda Zhao

Aligning LLMs through Multi-perspective User Preference Ranking-based Feedback for Programming Question Answering

Hongyu Yang, Liyang He, Min Hou, Shuanghong Shen, Rui Li, Jiahui Hou, Jianhui Ma, Junda Zhao

TL;DR

This work tackles Code Community Question Answering (CCQA) by recognizing that diverse user preferences and rapidly evolving APIs challenge traditional alignment of LLMs. It introduces ALMupQA, a three-stage framework combining Foundational SFT, Multi-perspective Preference Ranking Alignment (MPRA), and Retrieval-augmented In-context Learning (RIL), along with the StaCCQA dataset. MPRA leverages three scores—questioner bias, user votes, and content semantic quality—to rank multiple candidate answers in a listwise manner, while RIL mitigates outdated information by embedding similar Q&A pairs as few-shot prompts. Empirical results show substantial improvements over baselines in BLEU, BERTScore, CodeBERTScore, and GPT-4–based preference alignment, validating that multi-perspective ranking and retrieval augmentation yield user-centric CCQA solutions with practical impact in software engineering and programming education.

Abstract

Code Community Question Answering (CCQA) seeks to tackle programming-related issues, thereby boosting productivity in both software engineering and academic research. Recent advancements in Reinforcement Learning from Human Feedback (RLHF) have transformed the fine-tuning process of Large Language Models (LLMs) to produce responses that closely mimic human behavior. Leveraging LLMs with RLHF for practical CCQA applications has thus emerged as a promising area of study. Unlike standard code question-answering tasks, CCQA involves multiple possible answers, with varying user preferences for each response. Additionally, code communities often show a preference for new APIs. These challenges prevent LLMs from generating responses that cater to the diverse preferences of users in CCQA tasks. To address these issues, we propose a novel framework called Aligning LLMs through Multi-perspective User Preference Ranking-based Feedback for Programming Question Answering (ALMupQA) to create user-focused responses. Our approach starts with Multi-perspective Preference Ranking Alignment (MPRA), which synthesizes varied user preferences based on the characteristics of answers from code communities. We then introduce a Retrieval-augmented In-context Learning (RIL) module to mitigate the problem of outdated answers by retrieving responses to similar questions from a question bank. Due to the limited availability of high-quality, multi-answer CCQA datasets, we also developed a dataset named StaCCQA from real code communities. Extensive experiments demonstrated the effectiveness of the ALMupQA framework in terms of accuracy and user preference. Compared to the base model, ALMupQA showed nearly an 11% improvement in BLEU, with increases of 20% and 17.5% in BERTScore and CodeBERTScore, respectively.

Aligning LLMs through Multi-perspective User Preference Ranking-based Feedback for Programming Question Answering

TL;DR

Abstract

Paper Structure (25 sections, 13 equations, 6 figures, 4 tables)

This paper contains 25 sections, 13 equations, 6 figures, 4 tables.

Introduction
Related works
Code Community Question Answering
Preference Alignment for Question Answering
PRELIMINARIES
Task Formulation
Reinforcement Learning from Human Feedback
Methodology
Foundational Supervised Fine-Tuning
Multi-perspective Preference Ranking Alignment
Multi-perspective Ranking Set Construction
Preference Ranking Alignment
Retrieval-augmented In-context Learning
EXPERIMENTS
StaCCQA Dataset Construction
...and 10 more sections

Figures (6)

Figure 1: An example of a Code Community Question Answering. It encompasses key elements: a question $Q$, a pool of answers $\{a_1,\cdots,a_9\}$. Each $a_i$ contains its text of content, the number of votes, and a label indicating whether the answer has been accepted by the questioner. Additionally, in the semantic vector space, there exists a certain distance between the LLM-based answers $a_l$, the questioner-accepted answer $a_2$, and the users-preferred answers $a_1$.
Figure 2: Overall architecture of the ALMupQA framework, including three stages: (step1) foundational Supervised Fine-Tuning (SFT) for acquiring programming-specific knowledge, (step2) Multi-perspective Preference Ranking Alignment (MPRA) for integrating diverse preferences, and (step3) Retrieval-augmented In-context Learning (RIL) to address the issue of outdated answers by retrieving the most similar post-solution pairs as prompts.
Figure 3: In CCQA, we compared previous human alignment methods with our approach. (a) SFT aligns only the answer accepted by the questioner $a_2$, while (b) RLHF compares $a_2$ with the highest-voted users-preferred answer $a_1$, sampling two-meta candidates $p_i \succ p_j$ from the entire ranking to train a reward model, and then relies on this reward model to fine-tune the base LLM. (c) Ours contrasts $p_i$ with all members in the preference set $\{ p_1,\cdots,p_{N_i} \}$, based on the overall preference score $r$, which includes bias scores $s_q$, vote scores $s_u$, and content scores $s_l$.
Figure 4: The statistic of the number of votes for a question and the mapping relationship among the bias score $s_q$, vote scores $s_u$ and content scores $s_l$.
Figure 5: The consistency correlations between accuracy-based metrics (BLEU, ROUGE, CHRF, BERTScore, CB-PR, and CB-F) and preference-based metrics (GPT-4 evaluation scores). A positive correlation indicates that accuracy metrics improve as preference scores increase.
...and 1 more figures

Aligning LLMs through Multi-perspective User Preference Ranking-based Feedback for Programming Question Answering

TL;DR

Abstract

Aligning LLMs through Multi-perspective User Preference Ranking-based Feedback for Programming Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (6)