Table of Contents
Fetching ...

Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection

Xiaodan Li, Mengjie Wu, Yao Zhu, Yunna Lv, YueFeng Chen, Cen Chen, Jianmei Guo, Hui Xue

TL;DR

This work reframes LM safety as a real-time control problem, introducing StreamGuardBench to faithfully evaluate streaming guardrails and Kelp, a lightweight plug-in that performs token-by-token risk detection by reading intermediate LM activations. Kelp employs a Streaming Latent Dynamics Head to track evolving risk with a compact memory and discrete-time updates, and it enforces a benign-then-harmful temporal prior via Anchored Temporal Consistency to stabilize streaming decisions. Empirical results across text and vision-language tasks show Kelp achieving higher average F1 than state-of-the-art post-hoc guardrails while adding less than 0.5 ms per-token latency and using only 20M trainable parameters, with ablations confirming the complementary value of SLD and ATC. The combination of StreamGuardBench and Kelp offers a practical path toward real-time, decoding-time safety in production-scale LMs and VL models.

Abstract

Large models (LMs) are powerful content generators, yet their open-ended nature can also introduce potential risks, such as generating harmful or biased content. Existing guardrails mostly perform post-hoc detection that may expose unsafe content before it is caught, and the latency constraints further push them toward lightweight models, limiting detection accuracy. In this work, we propose Kelp, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline. Kelp leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection. To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench-a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision-language tasks. Across diverse models and datasets, Kelp consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency.

Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection

TL;DR

This work reframes LM safety as a real-time control problem, introducing StreamGuardBench to faithfully evaluate streaming guardrails and Kelp, a lightweight plug-in that performs token-by-token risk detection by reading intermediate LM activations. Kelp employs a Streaming Latent Dynamics Head to track evolving risk with a compact memory and discrete-time updates, and it enforces a benign-then-harmful temporal prior via Anchored Temporal Consistency to stabilize streaming decisions. Empirical results across text and vision-language tasks show Kelp achieving higher average F1 than state-of-the-art post-hoc guardrails while adding less than 0.5 ms per-token latency and using only 20M trainable parameters, with ablations confirming the complementary value of SLD and ATC. The combination of StreamGuardBench and Kelp offers a practical path toward real-time, decoding-time safety in production-scale LMs and VL models.

Abstract

Large models (LMs) are powerful content generators, yet their open-ended nature can also introduce potential risks, such as generating harmful or biased content. Existing guardrails mostly perform post-hoc detection that may expose unsafe content before it is caught, and the latency constraints further push them toward lightweight models, limiting detection accuracy. In this work, we propose Kelp, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline. Kelp leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection. To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench-a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision-language tasks. Across diverse models and datasets, Kelp consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency.

Paper Structure

This paper contains 20 sections, 5 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The workflow of our Kelp. Compared to traditional post-hoc guardrails, our Kelp can fully leverages the entire prefix context to score each token's harmfulness in real time, enabling immediate intervention (e.g., block, mask, or re-route) with few trainable parameters and low latency.
  • Figure 2: Illustration of our SLD. In contrast to prior studies using the last token embedding only for risk detection, we model the temporal evolution of risk across the generated sequence.
  • Figure 3: Cross‑model transfer evaluation. Each cell reports risk‑detection F1 on the target model (row) when the detector is trained on pairwise query–response data generated by the source model (column).
  • Figure 4: Distribution of the first harmful token positions.
  • Figure 5: F1 scores with different N.
  • ...and 2 more figures