Table of Contents
Fetching ...

Beyond Labels: Aligning Large Language Models with Human-like Reasoning

Muhammad Rafsan Kabir, Rafeed Mohammad Sultan, Ihsanul Haque Asif, Jawad Ibn Ahad, Fuad Rahman, Mohammad Ruhul Amin, Nabeel Mohammed, Shafin Rahman

TL;DR

This work addresses the risk of ethical misalignment in large language models by introducing the Dataset for Aligning Reasons (DFAR), a human-annotated corpus of ethical/unethical statements with corresponding reasons. It proposes a novel fine-tuning paradigm, training models with both ethics labels and human-like reasons ($L+R$), and demonstrates that this approach improves both classification accuracy and the generation of human-aligned reasons across Llama-2 and Mistral (7B) models. Through intra- and cross-dataset evaluation on DFAR and ETHOS, the authors show that $L+R$ tuning reduces misalignment (MAR) and yields more coherent, human-like justifications than traditional label-only fine-tuning or pre-trained baselines. The study also provides extensive ablations on temperature and prompts, and releases the DFAR dataset and code, offering a practical pathway for enhancing human-AI alignment in reasoning tasks.

Abstract

Aligning large language models (LLMs) with a human reasoning approach ensures that LLMs produce morally correct and human-like decisions. Ethical concerns are raised because current models are prone to generating false positives and providing malicious responses. To contribute to this issue, we have curated an ethics dataset named Dataset for Aligning Reasons (DFAR), designed to aid in aligning language models to generate human-like reasons. The dataset comprises statements with ethical-unethical labels and their corresponding reasons. In this study, we employed a unique and novel fine-tuning approach that utilizes ethics labels and their corresponding reasons (L+R), in contrast to the existing fine-tuning approach that only uses labels (L). The original pre-trained versions, the existing fine-tuned versions, and our proposed fine-tuned versions of LLMs were then evaluated on an ethical-unethical classification task and a reason-generation task. Our proposed fine-tuning strategy notably outperforms the others in both tasks, achieving significantly higher accuracy scores in the classification task and lower misalignment rates in the reason-generation task. The increase in classification accuracies and decrease in misalignment rates indicate that the L+R fine-tuned models align more with human ethics. Hence, this study illustrates that injecting reasons has substantially improved the alignment of LLMs, resulting in more human-like responses. We have made the DFAR dataset and corresponding codes publicly available at https://github.com/apurba-nsu-rnd-lab/DFAR.

Beyond Labels: Aligning Large Language Models with Human-like Reasoning

TL;DR

This work addresses the risk of ethical misalignment in large language models by introducing the Dataset for Aligning Reasons (DFAR), a human-annotated corpus of ethical/unethical statements with corresponding reasons. It proposes a novel fine-tuning paradigm, training models with both ethics labels and human-like reasons (), and demonstrates that this approach improves both classification accuracy and the generation of human-aligned reasons across Llama-2 and Mistral (7B) models. Through intra- and cross-dataset evaluation on DFAR and ETHOS, the authors show that tuning reduces misalignment (MAR) and yields more coherent, human-like justifications than traditional label-only fine-tuning or pre-trained baselines. The study also provides extensive ablations on temperature and prompts, and releases the DFAR dataset and code, offering a practical pathway for enhancing human-AI alignment in reasoning tasks.

Abstract

Aligning large language models (LLMs) with a human reasoning approach ensures that LLMs produce morally correct and human-like decisions. Ethical concerns are raised because current models are prone to generating false positives and providing malicious responses. To contribute to this issue, we have curated an ethics dataset named Dataset for Aligning Reasons (DFAR), designed to aid in aligning language models to generate human-like reasons. The dataset comprises statements with ethical-unethical labels and their corresponding reasons. In this study, we employed a unique and novel fine-tuning approach that utilizes ethics labels and their corresponding reasons (L+R), in contrast to the existing fine-tuning approach that only uses labels (L). The original pre-trained versions, the existing fine-tuned versions, and our proposed fine-tuned versions of LLMs were then evaluated on an ethical-unethical classification task and a reason-generation task. Our proposed fine-tuning strategy notably outperforms the others in both tasks, achieving significantly higher accuracy scores in the classification task and lower misalignment rates in the reason-generation task. The increase in classification accuracies and decrease in misalignment rates indicate that the L+R fine-tuned models align more with human ethics. Hence, this study illustrates that injecting reasons has substantially improved the alignment of LLMs, resulting in more human-like responses. We have made the DFAR dataset and corresponding codes publicly available at https://github.com/apurba-nsu-rnd-lab/DFAR.
Paper Structure (14 sections, 3 equations, 5 figures, 6 tables)

This paper contains 14 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Steps for evaluating responses generated by LLMs to compute Misalignment Rate (MAR). Five distinct human evaluators independently evaluate each LLM-generated response as Good or Bad. The final evaluation class is determined by majority voting. Finally, the total number of Bad responses is counted to calculate the Misalignment Rate.
  • Figure 1: Human evaluation spreadsheet showing statements, LLM-generated responses, evaluations of five individuals, and the overall evaluation.
  • Figure 2: Methodology for (a) Fine-tuning using labels only and (b) Fine-tuning using both labels & reasons on the DFAR dataset. The first approach involves training the model on the ethical-unethical labels without incorporating the accompanying reasons. LLM $L$ produces $\hat{y_i}$ based on the input $x_i$ that passes through the embedding layer. LLM's weights are being updated based on the loss. In our novel approach, LLM $L$ generates $\hat{y_i}$ and $\hat{r_i}$ based on the input $x_i$. LLM is fine-tuned based on the loss ($\mathcal{L}$) between embeddings of $\hat{y_i}$, $\hat{r_i}$, and $y_i$,$r_i$ of the dataset.
  • Figure 3: t-SNE visualization of two fine-tuned versions (a) Fine-tuned using Labels (L) and (b) Fine-tuned using Labels & Reasons (L+R) of Llama-2 (7B) on the DFAR test split.
  • Figure 4: The impact of (a) sampling temperature and (b) prompts on the responses generated by LLMs.