DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI Agents
Qi Li, Jianjun Xu, Pingtao Wei, Jiu Li, Peiqiang Zhao, Jiwei Shi, Xuan Zhang, Yanhui Yang, Xiaodong Hui, Peng Xu, Wenqin Shao
TL;DR
The paper tackles input safety and output trustworthiness in LLMs by integrating a proactive, fine-grained safety classification layer with a Retrieval-Augmented Generation pipeline anchored to a real-time, traceable knowledge base. It introduces a four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention) and pairs it with an interpretation LLM to ensure all outputs are grounded in current sources, reducing hallucinations and enabling traceability. Empirical results show the approach achieves a risk recall of 99.5% on risk data, and near-perfect safety scores on public and proprietary high-risk sets (e.g., EN 99.2,ZH 99.4, High-Risk 99.9), outperforming Qwen3Guard-Gen-8B and TinyR1-Safety-8B baselines. The framework demonstrates strong potential for deploying high-security, high-trust LLM applications across sensitive domains, with future work focusing on adversarial resilience and closer integration with dynamic knowledge evolution.
Abstract
With the widespread application of Large Language Models (LLMs), their associated security issues have become increasingly prominent, severely constraining their trustworthy deployment in critical domains. This paper proposes a novel safety response framework designed to systematically safeguard LLMs at both the input and output levels. At the input level, the framework employs a supervised fine-tuning-based safety classification model. Through a fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention), it performs precise risk identification and differentiated handling of user queries, significantly enhancing risk coverage and business scenario adaptability, and achieving a risk recall rate of 99.3%. At the output level, the framework integrates Retrieval-Augmented Generation (RAG) with a specifically fine-tuned interpretation model, ensuring all responses are grounded in a real-time, trustworthy knowledge base. This approach eliminates information fabrication and enables result traceability. Experimental results demonstrate that our proposed safety control model achieves a significantly higher safety score on public safety evaluation benchmarks compared to the baseline model, TinyR1-Safety-8B. Furthermore, on our proprietary high-risk test set, the framework's components attained a perfect 100% safety score, validating their exceptional protective capabilities in complex risk scenarios. This research provides an effective engineering pathway for building high-security, high-trust LLM applications.
