Table of Contents
Fetching ...

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Zheyu Lin, Jirui Yang, Hengqi Guo, Yubing Bao, Yao Guan

TL;DR

N-GLARE introduces a non-generative safety evaluator that operates entirely in latent space. By constructing Angular-Probabilistic Trajectories (APT) of hidden states under benign, jailbreak, plain query, and ideal refusal conditions, it quantifies safety via Jensen-Shannon Separability (JSS) across trajectory slices. The approach yields model-level safety proxies that closely align with traditional red-teaming rankings while reducing token and runtime costs by orders of magnitude. Extensive validation across more than 40 models and 20 red-teaming strategies demonstrates robustness, phase-aware dynamics, and strong transferability across alignment interventions, enabling real-time, low-cost safety diagnostics and continuous monitoring.

Abstract

Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming. N-GLARE reproduces the discriminative trends of large-scale red-teaming tests at less than 1\% of the token cost and the runtime cost, providing an efficient output-free evaluation proxy for real-time diagnostics.

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

TL;DR

N-GLARE introduces a non-generative safety evaluator that operates entirely in latent space. By constructing Angular-Probabilistic Trajectories (APT) of hidden states under benign, jailbreak, plain query, and ideal refusal conditions, it quantifies safety via Jensen-Shannon Separability (JSS) across trajectory slices. The approach yields model-level safety proxies that closely align with traditional red-teaming rankings while reducing token and runtime costs by orders of magnitude. Extensive validation across more than 40 models and 20 red-teaming strategies demonstrates robustness, phase-aware dynamics, and strong transferability across alignment interventions, enabling real-time, low-cost safety diagnostics and continuous monitoring.

Abstract

Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming. N-GLARE reproduces the discriminative trends of large-scale red-teaming tests at less than 1\% of the token cost and the runtime cost, providing an efficient output-free evaluation proxy for real-time diagnostics.

Paper Structure

This paper contains 27 sections, 39 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: JSS Shows Stronger Safety Alignment in Qwen3-4B-SafeRL Compared to Base and Alignment-Removed Variants. Statistics on the JSS distribution for 100 jailbreak trajectories and the ideal refusal pattern.
  • Figure 2: APT trajectories collected from different layer groups. Subfigures (a)(b)(c) correspond to safety-aligned models, while (d)(e)(f) are from jailbreak-tuned models. Red curves represent Plain Query trajectories, yellow denote Benign trajectories, blue indicate Jailbreak trajectories, and green correspond to Ideal Refusal trajectories. Safety-aligned LLMs exhibit well-separated and low-variance APT trajectories, whereas jailbreak-tuned LLMs show weakened separability and substantially higher variance across all query types, indicating degraded safety boundaries.
  • Figure 3: Multi-perspective cross-model safety ranking consistency between N-GLARE and traditional red-teaming. For each model family ((a–f; see Appendix Table \ref{['tab:model_abbr_map']} for model abbreviations on scatters)), the left scatter plot compares the relative ranks induced by our JSS-based mixed proxy (X-axis) and by Hybrid red-teaming scores (Y-axis); points lying near the diagonal blue band indicate close agreement between the two rankings. The right panel in each family shows the corresponding raw JSS-proxy (blue) and Hybrid (red) scores over all subset models, with shaded regions denoting bootstrap variability, again yielding almost identical within-family orderings. Note that "Hybrid" is the overall average across score all tested red-teaming tasks, with detailed task-wise scores provided in Appendix Table \ref{['tab:jb_safety_multi']}.
  • Figure 4: Coupled dynamics of JSS and ANM along the jailbreak process. The color indicates the normalized refusal-intensity score (ANM); The inset line plot in the upper-right overlays slice-wise JSS and ANM curves over the standardized progress axis, revealing a shared mid-phase rise and late-phase collapse, which highlights the strong correlation between geometric separability (JSS) and probability-level refusal tendency (ANM).
  • Figure 5: Layered subset rank consistency with statistically significant (bottom right) agreement across all $\tau$ thresholds.
  • ...and 5 more figures