Table of Contents
Fetching ...

InvThink: Towards AI Safety via Inverse Reasoning

Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, Hae Won Park

TL;DR

InvThink introduces inverse reasoning for LLM safety, enforcing harm enumeration, consequence analysis, and mitigation before generating outputs. The approach uses a three-phase pipeline—data augmentation with inverse traces, supervised fine-tuning, and reinforcement learning with safety rewards—where a latent safety trace $z_{inv}$ guides constrained generation of $y^*$ given $x$. Empirical evaluation across SafetyBench, TRIDENT, and Insider Threat shows InvThink achieves stronger safety gains than forward-only methods, scales robustly with model size, and preserves or enhances general reasoning, especially in high-stakes domains such as medicine, law, and finance. These results suggest that proactive, structured harm anticipation provides a scalable path to safer, more capable LLMs with broad practical impact.

Abstract

We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.

InvThink: Towards AI Safety via Inverse Reasoning

TL;DR

InvThink introduces inverse reasoning for LLM safety, enforcing harm enumeration, consequence analysis, and mitigation before generating outputs. The approach uses a three-phase pipeline—data augmentation with inverse traces, supervised fine-tuning, and reinforcement learning with safety rewards—where a latent safety trace guides constrained generation of given . Empirical evaluation across SafetyBench, TRIDENT, and Insider Threat shows InvThink achieves stronger safety gains than forward-only methods, scales robustly with model size, and preserves or enhances general reasoning, especially in high-stakes domains such as medicine, law, and finance. These results suggest that proactive, structured harm anticipation provides a scalable path to safer, more capable LLMs with broad practical impact.

Abstract

We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.

Paper Structure

This paper contains 36 sections, 7 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Overview.InvThink consists of three stages: (1) Data Augmentation: Original prompts are augmented with inverse reasoning traces generated by a teacher Language Model (LM) that explicitly enumerate potential harms before generating forward reasoning and safe responses. (2) Supervised Fine-tuning: The augmented dataset containing original prompts, inverse reasoning, and forward reasoning is used to train other model on both harm identification and constrained generation. (3) Reinforcement Learning: The model is further refined using GRPO with safety rewards, strengthening its ability to avoid identified harms while maintaining task performance.
  • Figure 2: Insider Threat Rates across Models. Reasoning models are more prone to exhibit blackmailing behavior, while non-reasoning models are relatively safer. The InvThink safeguard is particularly effective in driving the blackmailing rates for reasoning models close to zero.
  • Figure 3: Safety performance on TRIDENT across three LLM model families. Across all LLM model families, InvThink consistently achieves the highest safety performance, substantially outperforming CoT and SafetyPrompt baselines. Notably, InvThink shows stronger scaling behavior, with performance improvements amplifying as model size increases, while baseline methods either plateau (SafetyPrompt) or degrade (CoT) at larger scales. These findings suggest that InvThink not only enhances safety alignment but also leverages model capacity more effectively, indicating its robustness and scalability across diverse architectures. Results are averaged over 5 random seeds.
  • Figure 4: The safety score of InvThink with varying number of reasoning routes. The optimal number of routes varies by model size, with smaller models (0.5-3B) showing minimal improvement beyond 5 routes, while mid-range models (7-14B) benefit from up to 7 routes. The large models (32-72B) achieve peak performance at 5-7 routes before showing slight degradation.
  • Figure 5: Safety-Intelligence Analysis. Safety scores (%) for CoT, SafetyPrompt, and InvThink across three LLM families from Google, OpenAI, and Anthropic, plotted against Intelligence Index scores obtained from https://artificialanalysis.ai/. Each model family exhibits distinct patterns in the safety-intelligence relationship.
  • ...and 10 more figures