InvThink: Towards AI Safety via Inverse Reasoning
Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, Hae Won Park
TL;DR
InvThink introduces inverse reasoning for LLM safety, enforcing harm enumeration, consequence analysis, and mitigation before generating outputs. The approach uses a three-phase pipeline—data augmentation with inverse traces, supervised fine-tuning, and reinforcement learning with safety rewards—where a latent safety trace $z_{inv}$ guides constrained generation of $y^*$ given $x$. Empirical evaluation across SafetyBench, TRIDENT, and Insider Threat shows InvThink achieves stronger safety gains than forward-only methods, scales robustly with model size, and preserves or enhances general reasoning, especially in high-stakes domains such as medicine, law, and finance. These results suggest that proactive, structured harm anticipation provides a scalable path to safer, more capable LLMs with broad practical impact.
Abstract
We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.
