Table of Contents
Fetching ...

RvLLM: LLM Runtime Verification with Domain Knowledge

Yedi Zhang, Sun Yi Emma, Annabelle Lee Jia En, Jin Song Dong

TL;DR

The paper addresses the unreliability of LLM outputs in safety-critical domains by introducing RvLLM, a runtime verification framework that enforces domain-specific constraints encoded in ESL.ESL is designed to be approachable for domain experts while enabling formal reasoning via a DeNF-based specification and predicate-rule representation.RvLLM uses a four-stage pipeline—interpretation abstraction, rule normalization, forward chaining, and query generation—to detect inconsistencies and verify LLM outputs against domain knowledge.Experiments across MRT-act violation detection, numerical comparison, and inequality solving demonstrate that RvLLM improves true-positive detection and offers a practical path toward reliable, domain-aware LLM deployments, though perception and ESL expressiveness remain areas for improvement.

Abstract

Large language models (LLMs) have emerged as a dominant AI paradigm due to their exceptional text understanding and generation capabilities. However, their tendency to generate inconsistent or erroneous outputs challenges their reliability, especially in high-stakes domains requiring accuracy and trustworthiness. Existing research primarily focuses on detecting and mitigating model misbehavior in general-purpose scenarios, often overlooking the potential of integrating domain-specific knowledge. In this work, we advance misbehavior detection by incorporating domain knowledge. The core idea is to design a general specification language that enables domain experts to customize domain-specific predicates in a lightweight and intuitive manner, supporting later runtime verification of LLM outputs. To achieve this, we design a novel specification language, ESL, and introduce a runtime verification framework, RvLLM, to validate LLM output against domain-specific constraints defined in ESL. We evaluate RvLLM on three representative tasks: violation detection against Singapore Rapid Transit Systems Act, numerical comparison, and inequality solving. Experimental results demonstrate that RvLLM effectively detects erroneous outputs across various LLMs in a lightweight and flexible manner. The results reveal that despite their impressive capabilities, LLMs remain prone to low-level errors due to limited interpretability and a lack of formal guarantees during inference, and our framework offers a potential long-term solution by leveraging expert domain knowledge to rigorously and efficiently verify LLM outputs.

RvLLM: LLM Runtime Verification with Domain Knowledge

TL;DR

The paper addresses the unreliability of LLM outputs in safety-critical domains by introducing RvLLM, a runtime verification framework that enforces domain-specific constraints encoded in ESL.ESL is designed to be approachable for domain experts while enabling formal reasoning via a DeNF-based specification and predicate-rule representation.RvLLM uses a four-stage pipeline—interpretation abstraction, rule normalization, forward chaining, and query generation—to detect inconsistencies and verify LLM outputs against domain knowledge.Experiments across MRT-act violation detection, numerical comparison, and inequality solving demonstrate that RvLLM improves true-positive detection and offers a practical path toward reliable, domain-aware LLM deployments, though perception and ESL expressiveness remain areas for improvement.

Abstract

Large language models (LLMs) have emerged as a dominant AI paradigm due to their exceptional text understanding and generation capabilities. However, their tendency to generate inconsistent or erroneous outputs challenges their reliability, especially in high-stakes domains requiring accuracy and trustworthiness. Existing research primarily focuses on detecting and mitigating model misbehavior in general-purpose scenarios, often overlooking the potential of integrating domain-specific knowledge. In this work, we advance misbehavior detection by incorporating domain knowledge. The core idea is to design a general specification language that enables domain experts to customize domain-specific predicates in a lightweight and intuitive manner, supporting later runtime verification of LLM outputs. To achieve this, we design a novel specification language, ESL, and introduce a runtime verification framework, RvLLM, to validate LLM output against domain-specific constraints defined in ESL. We evaluate RvLLM on three representative tasks: violation detection against Singapore Rapid Transit Systems Act, numerical comparison, and inequality solving. Experimental results demonstrate that RvLLM effectively detects erroneous outputs across various LLMs in a lightweight and flexible manner. The results reveal that despite their impressive capabilities, LLMs remain prone to low-level errors due to limited interpretability and a lack of formal guarantees during inference, and our framework offers a potential long-term solution by leveraging expert domain knowledge to rigorously and efficiently verify LLM outputs.

Paper Structure

This paper contains 26 sections, 4 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: An overview of RvLLM. Given an LLM's context and outputs, RvLLM first extracts relevant propositions and translates the ESL specification into normalized propositional logic formulae. These are then processed through forward chaining to detect inconsistencies and infer new knowledge, which subsequently guides follow-up queries and consistency checks across successive LLM responses.
  • Figure 2: Runtime verification of GPT 4.1 nano by RvLLM for a number comparison task.
  • Figure 3: Performance comparison of different methods in the violation detection task, where QW, GF/GF-Lite, and DS-V3 denote Qwen, Gemini 2.0 Flash/Flash Lite, and DeepSeek-V3, respectively.
  • Figure 4: The initial graph $G$ is constructed in (\ref{['fig:rawGraph']}). The literal node sets $\texttt{Lit}^\uparrow$ and $\texttt{Lit}^\downarrow$ are updated step by step (from (\ref{['fig:initGraph']}) to (\ref{['fig:step-2-Graph']})), following the modus ponens inference rules, and an inconsistency is detected on the literal node $N$ in (\ref{['fig:step-2-Graph']}).
  • Figure 5: The query prompt for case study 1, where [[Context]] is replaced with the corresponding testing scenario.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Example 1
  • Example 2
  • Example 3