Table of Contents
Fetching ...

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Ruibin Yuan, Xueqi Cheng

TL;DR

HiddenGuard replaces coarse refusal with fine-grained, token-level moderation by integrating Prism, a router-driven framework that operates on intermediate representations alongside LoRA-based activators. The combined system performs real-time redaction at the token level while maintaining base-model utility, achieving a $F_1$ score above $0.90$ on harmful-content detection and redaction. Empirical results across multiple LLMs show strong redaction accuracy and substantial robustness to adversarial prompts with minimal impact on general benchmarks like $MMLU$ and $MT$-Bench. The approach is supported by a token-level annotated dataset and exhibits interpretable separation in router representations, highlighting a scalable path toward safer, more nuanced LLM deployment.

Abstract

As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities, which can result in over-censorship of benign content. To overcome these challenges, we introduce HiddenGuard, a novel framework for fine-grained, safe generation in LLMs. HiddenGuard incorporates Prism (rePresentation Router for In-Stream Moderation), which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content by leveraging intermediate hidden states. This fine-grained approach allows for more nuanced, context-aware moderation, enabling the model to generate informative responses while selectively redacting or replacing sensitive information, rather than outright refusal. We also contribute a comprehensive dataset with token-level fine-grained annotations of potentially harmful information across diverse contexts. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content while preserving the overall utility and informativeness of the model's responses.

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

TL;DR

HiddenGuard replaces coarse refusal with fine-grained, token-level moderation by integrating Prism, a router-driven framework that operates on intermediate representations alongside LoRA-based activators. The combined system performs real-time redaction at the token level while maintaining base-model utility, achieving a score above on harmful-content detection and redaction. Empirical results across multiple LLMs show strong redaction accuracy and substantial robustness to adversarial prompts with minimal impact on general benchmarks like and -Bench. The approach is supported by a token-level annotated dataset and exhibits interpretable separation in router representations, highlighting a scalable path toward safer, more nuanced LLM deployment.

Abstract

As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities, which can result in over-censorship of benign content. To overcome these challenges, we introduce HiddenGuard, a novel framework for fine-grained, safe generation in LLMs. HiddenGuard incorporates Prism (rePresentation Router for In-Stream Moderation), which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content by leveraging intermediate hidden states. This fine-grained approach allows for more nuanced, context-aware moderation, enabling the model to generate informative responses while selectively redacting or replacing sensitive information, rather than outright refusal. We also contribute a comprehensive dataset with token-level fine-grained annotations of potentially harmful information across diverse contexts. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content while preserving the overall utility and informativeness of the model's responses.
Paper Structure (41 sections, 3 theorems, 31 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 41 sections, 3 theorems, 31 equations, 7 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Suppose $f_{\theta^*}$ is obtained by optimizing a safety-oriented loss $\mathcal{L}_{\text{safety}}$ over the data distribution $\mathcal{D}$: $\theta^* = \arg\min_{\theta \in \Theta} \mathbb{E}_{x \sim \mathcal{D}} [\mathcal{L}_{\text{safety}}(f_\theta(x))]$. Then, under reasonable assumptions, th where $\mathcal{L}_{\text{utility}}: \mathcal{Y} \rightarrow \mathbb{R}_{\geq 0}$ measures the util

Figures (7)

  • Figure 1: Comparison of LLM responses to a sensitive query. Token-level adaptive output (right) of HiddenGuard selectively redacts harmful content while preserving useful information, in contrast to refusal-based output (top left) completely rejects the query and jailbroken output (bottom left) provides unrestricted information.
  • Figure 2: HiddenGuard architecture and Prism training pipeline. The upper part showcases the inference process, where LoRA activators analyze hidden states to generate activation signals, guiding the router in real-time token-level moderation. The lower part illustrates Prism training, demonstrating how token-level labeled data trains LoRA activators and the router to identify subtle patterns of harmful content across various contexts, enabling precise content redaction.
  • Figure 2: ASR results of refusal-trained models versus HiddenGuard under different attack methods. Lower values indicate better robustness against adversarial attacks.
  • Figure 3: Capability test. MMLU-Pro and MT-Bench scores for refusal-trained models and HiddenGuard. Higher scores indicate better general language capabilities.
  • Figure 3: UMAP projection of token-level activator representations.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Theorem 1: Inherent Trade-off in Global Output-Level Optimization
  • Definition 1: Representation Collapse
  • Definition 2: Gradient Masking
  • Theorem 2: Information Preservation
  • proof
  • Theorem 3: Optimal Safety-Utility Trade-off