N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator
Zheyu Lin, Jirui Yang, Hengqi Guo, Yubing Bao, Yao Guan
TL;DR
N-GLARE introduces a non-generative safety evaluator that operates entirely in latent space. By constructing Angular-Probabilistic Trajectories (APT) of hidden states under benign, jailbreak, plain query, and ideal refusal conditions, it quantifies safety via Jensen-Shannon Separability (JSS) across trajectory slices. The approach yields model-level safety proxies that closely align with traditional red-teaming rankings while reducing token and runtime costs by orders of magnitude. Extensive validation across more than 40 models and 20 red-teaming strategies demonstrates robustness, phase-aware dynamics, and strong transferability across alignment interventions, enabling real-time, low-cost safety diagnostics and continuous monitoring.
Abstract
Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming. N-GLARE reproduces the discriminative trends of large-scale red-teaming tests at less than 1\% of the token cost and the runtime cost, providing an efficient output-free evaluation proxy for real-time diagnostics.
