Table of Contents
Fetching ...

Automated Consistency Analysis of LLMs

Aditya Patwardhan, Vivek Vaidya, Ashish Kundu

TL;DR

The paper tackles the trustworthiness of large language models in cybersecurity by formalizing response consistency and introducing a formal Consistency Validation Framework. It combines self-validation and cross-validation with a metric-based Consistency Algorithm that uses four similarity metrics and three operational thresholds to assess consistency across prompts issued within a time window $Δt$. Empirical results across multiple LLMs show improvements with newer models but persistent inconsistencies, particularly for situational, open-ended questions, and reveal limitations of agreement-based validation in abstract scenarios. The work provides a rigorous, automated methodology for evaluating LLM reliability in security-critical tasks and outlines directions for enhancing consistency and reducing hallucinations through future analyses of model internals and task-specific validation strategies.

Abstract

Generative AI (Gen AI) with large language models (LLMs) are being widely adopted across the industry, academia and government. Cybersecurity is one of the key sectors where LLMs can be and/or are already being used. There are a number of problems that inhibit the adoption of trustworthy Gen AI and LLMs in cybersecurity and such other critical areas. One of the key challenge to the trustworthiness and reliability of LLMs is: how consistent an LLM is in its responses? In this paper, we have analyzed and developed a formal definition of consistency of responses of LLMs. We have formally defined what is consistency of responses and then develop a framework for consistency evaluation. The paper proposes two approaches to validate consistency: self-validation, and validation across multiple LLMs. We have carried out extensive experiments for several LLMs such as GPT4oMini, GPT3.5, Gemini, Cohere, and Llama3, on a security benchmark consisting of several cybersecurity questions: informational and situational. Our experiments corroborate the fact that even though these LLMs are being considered and/or already being used for several cybersecurity tasks today, they are often inconsistent in their responses, and thus are untrustworthy and unreliable for cybersecurity.

Automated Consistency Analysis of LLMs

TL;DR

The paper tackles the trustworthiness of large language models in cybersecurity by formalizing response consistency and introducing a formal Consistency Validation Framework. It combines self-validation and cross-validation with a metric-based Consistency Algorithm that uses four similarity metrics and three operational thresholds to assess consistency across prompts issued within a time window . Empirical results across multiple LLMs show improvements with newer models but persistent inconsistencies, particularly for situational, open-ended questions, and reveal limitations of agreement-based validation in abstract scenarios. The work provides a rigorous, automated methodology for evaluating LLM reliability in security-critical tasks and outlines directions for enhancing consistency and reducing hallucinations through future analyses of model internals and task-specific validation strategies.

Abstract

Generative AI (Gen AI) with large language models (LLMs) are being widely adopted across the industry, academia and government. Cybersecurity is one of the key sectors where LLMs can be and/or are already being used. There are a number of problems that inhibit the adoption of trustworthy Gen AI and LLMs in cybersecurity and such other critical areas. One of the key challenge to the trustworthiness and reliability of LLMs is: how consistent an LLM is in its responses? In this paper, we have analyzed and developed a formal definition of consistency of responses of LLMs. We have formally defined what is consistency of responses and then develop a framework for consistency evaluation. The paper proposes two approaches to validate consistency: self-validation, and validation across multiple LLMs. We have carried out extensive experiments for several LLMs such as GPT4oMini, GPT3.5, Gemini, Cohere, and Llama3, on a security benchmark consisting of several cybersecurity questions: informational and situational. Our experiments corroborate the fact that even though these LLMs are being considered and/or already being used for several cybersecurity tasks today, they are often inconsistent in their responses, and thus are untrustworthy and unreliable for cybersecurity.

Paper Structure

This paper contains 22 sections, 12 figures, 4 tables, 3 algorithms.

Figures (12)

  • Figure 1: Self-Validation Architecure
  • Figure 2: Cross-Validation Architecture
  • Figure 3: Consistency Analysis for Low threshold
  • Figure 4: Consistency Analysis for Medium threshold
  • Figure 5: Consistency Analysis for High threshold
  • ...and 7 more figures