Table of Contents
Fetching ...

How much do LLMs learn from negative examples?

Shadi Hamdan, Deniz Yuret

TL;DR

The paper demonstrates that negative examples can contribute disproportionately to learning in LLMs, using a likelihood-ratio framework (Likra) with separate positive and negative heads. Training with negative, especially near-miss, examples yields sharp accuracy gains and improves the model’s ability to reject plausible but incorrect answers, often outperforming traditional SFT with far fewer positive prompts. The findings suggest that much of a pretrained model's factual knowledge can be unlocked by contrasting correct versus plausible incorrect outputs, with implications for reducing hallucinations and enhancing evaluation. The approach is validated across ARC-Challenge and Hellaswag benchmarks and several base models, highlighting a potentially general mechanism for improving discriminative accuracy while clarifying the role of alignment data in LLM training.

Abstract

Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.

How much do LLMs learn from negative examples?

TL;DR

The paper demonstrates that negative examples can contribute disproportionately to learning in LLMs, using a likelihood-ratio framework (Likra) with separate positive and negative heads. Training with negative, especially near-miss, examples yields sharp accuracy gains and improves the model’s ability to reject plausible but incorrect answers, often outperforming traditional SFT with far fewer positive prompts. The findings suggest that much of a pretrained model's factual knowledge can be unlocked by contrasting correct versus plausible incorrect outputs, with implications for reducing hallucinations and enhancing evaluation. The approach is validated across ARC-Challenge and Hellaswag benchmarks and several base models, highlighting a potentially general mechanism for improving discriminative accuracy while clarifying the role of alignment data in LLM training.

Abstract

Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.

Paper Structure

This paper contains 13 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Comparison of learning curves for supervised fine-tuning (SFT) and likelihood-ratio (Likra) models for the ARC-Challenge benchmark clark2018think using Mistral-7B-v0.1 jiang2023mistral as a base model. The SFT model is the result of regular supervised fine-tuning using only correct question-answer pairs. The Likra model uses an equal number of incorrect question-answer pairs to train a negative head, uses the likelihood ratio of the SFT model and the negative head to decide on its answers.
  • Figure 2: Likelihood of the correct answer vs the most likely incorrect answer during training.
  • Figure 3: Likra with (SFT) and without (Base) positive examples.
  • Figure 4: Changing the weight of the negative head.
  • Figure 5: Training with different negative examples.
  • ...and 1 more figures