Table of Contents
Fetching ...

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

Berk Atil, Vipul Gupta, Sarkar Snigdha Sarathi Das, Rebecca J. Passonneau

TL;DR

The paper investigates whether large LLMs can reliably rank the harmfulness of outputs from smaller LLMs. It builds a dataset by eliciting harm-triggered outputs from three ≤10B open-source LLMs and collecting human judgments on relative harmfulness. It then evaluates three state-of-the-art large LLMs as annotators using Rank-Biased Overlap to compare with humans. Findings show that smaller LLMs differ in harm propensity and that large LLMs agree moderately with each other but poorly with human judgments, highlighting the need for improved harm mitigation and more reliable annotation methods.

Abstract

Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations. Smaller LLMs can be deployed where compute resources are constrained, such as edge devices, but with different propensity to generate harmful output. Mitigation of LLM harm typically depends on annotating the harmfulness of LLM output, which is expensive to collect from humans. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we evaluate three state-of-the-art large LLMs on their ability to annotate the harmfulness of these responses. We find that the smaller models differ with respect to harmfulness. We also find that large LLMs show low to moderate agreement with humans. These findings underline the need for further work on harm mitigation in LLMs.

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

TL;DR

The paper investigates whether large LLMs can reliably rank the harmfulness of outputs from smaller LLMs. It builds a dataset by eliciting harm-triggered outputs from three ≤10B open-source LLMs and collecting human judgments on relative harmfulness. It then evaluates three state-of-the-art large LLMs as annotators using Rank-Biased Overlap to compare with humans. Findings show that smaller LLMs differ in harm propensity and that large LLMs agree moderately with each other but poorly with human judgments, highlighting the need for improved harm mitigation and more reliable annotation methods.

Abstract

Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations. Smaller LLMs can be deployed where compute resources are constrained, such as edge devices, but with different propensity to generate harmful output. Mitigation of LLM harm typically depends on annotating the harmfulness of LLM output, which is expensive to collect from humans. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we evaluate three state-of-the-art large LLMs on their ability to annotate the harmfulness of these responses. We find that the smaller models differ with respect to harmfulness. We also find that large LLMs show low to moderate agreement with humans. These findings underline the need for further work on harm mitigation in LLMs.

Paper Structure

This paper contains 27 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Histogram for percentage ranking of the models, after excluding triplets with any 0 rating.
  • Figure 2: Pairwise wins (least harm).
  • Figure 3: Mistral Histogram
  • Figure 4: MPT Histogram
  • Figure 5: StableLM Histogram
  • ...and 3 more figures