Table of Contents
Fetching ...

DRAFT: Task Decoupled Latent Reasoning for Agent Safety

Lin Wang, Junfeng Fang, Dan Zhang, Fei Shen, Xiang Wang, Tat-Seng Chua

Abstract

The advent of tool-using LLM agents shifts safety monitoring from output moderation to auditing long, noisy interaction trajectories, where risk-critical evidence is sparse-making standard binary supervision poorly suited for credit assignment. To address this, we propose DRAFT (Task Decoupled Latent Reasoning for Agent Safety), a latent reasoning framework that decouples safety judgment into two trainable stages: an Extractor that distills the full trajectory into a compact continuous latent draft, and a Reasoner that jointly attends to the draft and the original trajectory to predict safety. DRAFT avoids lossy explicit summarize-then-judge pipelines by performing evidence aggregation in latent space, enabling end-to-end differentiable training.Across benchmarks including ASSEBench and R-Judge, DRAFT consistently outperforms strong baselines, improving accuracy from 63.27% (LoRA) to 91.18% averaged over benchmarks, and learns more separable representations. Ablations demonstrate a clear synergy between the Extractor and the Reasoner.Overall, DRAFT suggests that continuous latent reasoning prior to readout is a practical path to robust agent safety under long-context supervision with sparse evidence.

DRAFT: Task Decoupled Latent Reasoning for Agent Safety

Abstract

The advent of tool-using LLM agents shifts safety monitoring from output moderation to auditing long, noisy interaction trajectories, where risk-critical evidence is sparse-making standard binary supervision poorly suited for credit assignment. To address this, we propose DRAFT (Task Decoupled Latent Reasoning for Agent Safety), a latent reasoning framework that decouples safety judgment into two trainable stages: an Extractor that distills the full trajectory into a compact continuous latent draft, and a Reasoner that jointly attends to the draft and the original trajectory to predict safety. DRAFT avoids lossy explicit summarize-then-judge pipelines by performing evidence aggregation in latent space, enabling end-to-end differentiable training.Across benchmarks including ASSEBench and R-Judge, DRAFT consistently outperforms strong baselines, improving accuracy from 63.27% (LoRA) to 91.18% averaged over benchmarks, and learns more separable representations. Ablations demonstrate a clear synergy between the Extractor and the Reasoner.Overall, DRAFT suggests that continuous latent reasoning prior to readout is a practical path to robust agent safety under long-context supervision with sparse evidence.

Paper Structure

This paper contains 82 sections, 31 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Left: Comparison between standard one-stage methods and DRAFT. (a, c) Illustration of objectives, where $\theta,\lambda,\gamma$ denote different parameter spaces. (b, d) t-SNE visualization of hidden representations. (e) Comparison between explicit and latent reasoning outputs. Right: Accuracy of Qwen3Guard-Gen-4B on three agent safety datasets. AA denotes AgentAuditor luo2025agentauditor.
  • Figure 2: Overview of the DRAFT two-stage latent reasoning framework.
  • Figure 3: Last-layer feature t-SNE of the Reasoner on three benchmarks (colors denote benchmarks); marker shape and intensity indicate the safe and unsafe labels. Top: LoRA-SFT Reasoner hidden state features. Bottom: DRAFT Reasoner hidden state features.
  • Figure 4: Accuracy (%) based on different Extractor latent reasoning length $L_s$ across datasets and backbones. Shaded regions indicate standard deviation over seeds. A longer latent inference length is not necessarily better; the stability of training depends on the dataset quality and the amount of trainable data.
  • Figure 5: Position ablation on latent reasoning insertion. "Explicit" denotes summarize-then-judge.
  • ...and 3 more figures