Table of Contents
Fetching ...

NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels

Junfeng Fang, Nachuan Chen, Houcheng Jiang, Dan Zhang, Fei Shen, Xiang Wang, Xiangnan He, Tat-Seng Chua

TL;DR

NExT-Guard is introduced, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs) using pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level supervision.

Abstract

Large language models are increasingly deployed in streaming scenarios, rendering conventional post-hoc safeguards ineffective as they fail to interdict unsafe content in real-time. While streaming safeguards based on token-level supervised training could address this, they necessitate expensive annotations and suffer from severe overfitting. In this work, we challenge the paradigm that streaming safety must rely on token-level supervised training. Instead, it is an inherent capability of well-trained post-hoc safeguards, as they already encode token-level risk signals in hidden representations. Hence, we introduce NExT-Guard, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs). It uses pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level supervision. Experimental results show that NExT-Guard outperforms both post-hoc and streaming safeguards based on supervised training, with superior robustness across models, SAE variants, and risk scenarios. These results make NExT-Guard a universal and scalable paradigm for real-time safety, accelerating the practical deployment of streaming safeguards.

NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels

TL;DR

NExT-Guard is introduced, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs) using pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level supervision.

Abstract

Large language models are increasingly deployed in streaming scenarios, rendering conventional post-hoc safeguards ineffective as they fail to interdict unsafe content in real-time. While streaming safeguards based on token-level supervised training could address this, they necessitate expensive annotations and suffer from severe overfitting. In this work, we challenge the paradigm that streaming safety must rely on token-level supervised training. Instead, it is an inherent capability of well-trained post-hoc safeguards, as they already encode token-level risk signals in hidden representations. Hence, we introduce NExT-Guard, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs). It uses pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level supervision. Experimental results show that NExT-Guard outperforms both post-hoc and streaming safeguards based on supervised training, with superior robustness across models, SAE variants, and risk scenarios. These results make NExT-Guard a universal and scalable paradigm for real-time safety, accelerating the practical deployment of streaming safeguards.
Paper Structure (38 sections, 17 equations, 6 figures, 1 table)

This paper contains 38 sections, 17 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Comparison between current streaming safety and NExT-Guard. (a) Current paradigm relies heavily on token-level labels. (b) Qwen3-Guard-8B-Streaming qwen3guard is prone to severe overfitting to individual token semantics. (c) NExT-Guard achieves streaming safety without training. (d) Unsafe tokens are precisely identified by NExT-Guard. Best viewed in color.
  • Figure 2: Overview of NExT-Guard, which identifies safety-relevant SAE features offline and integrates them to calculate the safety score for training-free streaming intervention. Best viewed in color.
  • Figure 3: Intervention position distributions. Relative token positions where safeguards first trigger intervention, shown against human-labeled ground-truth unsafe token onsets. Best viewed in color.
  • Figure 4: Precision--recall scatter of SAE features on Aegis2.0 categories. Each point is a feature evaluated as a category-specific detector; color denotes its discriminative score. Best viewed in color.
  • Figure 5: Interpretable unsafe SAE features. Token-level activation visualizations for selected features on representative examples, with a comparison to Qwen3Guard-8B-Stream. Best viewed in color.
  • ...and 1 more figures