Table of Contents
Fetching ...

Harmonic LLMs are Trustworthy

Nicholas S. Kersting, Mohammad Rahman, Suchismitha Vedala, Yang Wang

TL;DR

This work proposes Harmonic Robustness, a model-agnostic, unsupervised metric for assessing the trustworthiness of LLM outputs by measuring anharmonicity $\gamma$, the deviation from the mean value property of a harmonic function ($\nabla^2 f = 0$). The method perturbs inputs by appending non-semantic ASCII characters, maps outputs to embeddings, and compares the original embedding to the average of perturbed outputs via the angle between embeddings, effectively testing stability under input perturbations. Empirically, $\gamma \to 0$ correlates with higher trustworthiness across 10 LLMs and three QA domains, and low-$\gamma$ responses generally receive higher quality ratings; the study also shows that mid-size open-source models can rival or surpass large commercial models in certain domains. Additionally, the work demonstrates adversarial prompt discovery by gradient ascent on $\gamma$, highlighting both a vulnerability and a potential use as a robustness-testing and model-monitoring tool with implications for model cards and automated retraining strategies.

Abstract

We introduce an intuitive method to test the robustness (stability and explainability) of any black-box LLM in real-time via its local deviation from harmoniticity, denoted as $γ$. To the best of our knowledge this is the first completely model-agnostic and unsupervised method of measuring the robustness of any given response from an LLM, based upon the model itself conforming to a purely mathematical standard. To show general application and immediacy of results, we measure $γ$ in 10 popular LLMs (ChatGPT, Claude-2.1, Claude3.0, GPT-4, GPT-4o, Smaug-72B, Mixtral-8x7B, Llama2-7B, Mistral-7B and MPT-7B) across thousands of queries in three objective domains: WebQA, ProgrammingQA, and TruthfulQA. Across all models and domains tested, human annotation confirms that $γ\to 0$ indicates trustworthiness, and conversely searching higher values of $γ$ easily exposes examples of hallucination, a fact that enables efficient adversarial prompt generation through stochastic gradient ascent in $γ$. The low-$γ$ leaders among the models in the respective domains are GPT-4o, GPT-4, and Smaug-72B, providing evidence that mid-size open-source models can win out against large commercial models.

Harmonic LLMs are Trustworthy

TL;DR

This work proposes Harmonic Robustness, a model-agnostic, unsupervised metric for assessing the trustworthiness of LLM outputs by measuring anharmonicity , the deviation from the mean value property of a harmonic function (). The method perturbs inputs by appending non-semantic ASCII characters, maps outputs to embeddings, and compares the original embedding to the average of perturbed outputs via the angle between embeddings, effectively testing stability under input perturbations. Empirically, correlates with higher trustworthiness across 10 LLMs and three QA domains, and low- responses generally receive higher quality ratings; the study also shows that mid-size open-source models can rival or surpass large commercial models in certain domains. Additionally, the work demonstrates adversarial prompt discovery by gradient ascent on , highlighting both a vulnerability and a potential use as a robustness-testing and model-monitoring tool with implications for model cards and automated retraining strategies.

Abstract

We introduce an intuitive method to test the robustness (stability and explainability) of any black-box LLM in real-time via its local deviation from harmoniticity, denoted as . To the best of our knowledge this is the first completely model-agnostic and unsupervised method of measuring the robustness of any given response from an LLM, based upon the model itself conforming to a purely mathematical standard. To show general application and immediacy of results, we measure in 10 popular LLMs (ChatGPT, Claude-2.1, Claude3.0, GPT-4, GPT-4o, Smaug-72B, Mixtral-8x7B, Llama2-7B, Mistral-7B and MPT-7B) across thousands of queries in three objective domains: WebQA, ProgrammingQA, and TruthfulQA. Across all models and domains tested, human annotation confirms that indicates trustworthiness, and conversely searching higher values of easily exposes examples of hallucination, a fact that enables efficient adversarial prompt generation through stochastic gradient ascent in . The low- leaders among the models in the respective domains are GPT-4o, GPT-4, and Smaug-72B, providing evidence that mid-size open-source models can win out against large commercial models.
Paper Structure (8 sections, 3 equations, 2 figures, 16 tables, 1 algorithm)

This paper contains 8 sections, 3 equations, 2 figures, 16 tables, 1 algorithm.

Figures (2)

  • Figure 1: Human trustworthiness ratings for all LLMs across domains, binned by $\gamma$. Top row is average across all models, last column is average over all domains. Average ratings are shown in green. There is a general negative correlation of $\gamma$ with rating, as shown by the negative "m" parameters in linear fits. In particular, $\gamma \to 0$ limits (shown as the "b" intercepts) have high $>4$ ratings.
  • Figure 2: Distribution of $\gamma$ for models across a larger number of questions in the three domains.