Detecting and Steering LLMs' Empathy in Action

Juan P. Cadile

Detecting and Steering LLMs' Empathy in Action

Juan P. Cadile

TL;DR

The paper investigates empathy-in-action as a linear direction in LLM activation space, proposing activation probes to detect empathic reasoning and assess the ability to steer such behavior. By evaluating Phi-3-mini-4k, Qwen2.5-7B, and Dolphin-Llama-3.1-8B, it demonstrates near-perfect within-model detection (AUROC ≈ $0.996$–$1.00$) with strong behavioral correlations (e.g., $r=0.71$), and reveals a model-specific steering landscape: safety-trained Qwen achieves robust bidirectional steering, Dolphin exhibits strong pro-empathy steering but catastrophic anti-empathy failures, and Phi-3 shows moderate steering with coherence similar to safety-trained models. Crucially, cross-model probe agreement is weak, indicating architecture-dependent geometric implementations despite convergent detection, and safety training appears to enhance steering robustness more than preventing manipulation. The findings underscore the need for model-aware interpretability tools and cross-model alignment strategies, while suggesting that real-world deployment should account for varying steerability across architectures and scenarios.

Abstract

We investigate empathy-in-action -- the willingness to sacrifice task efficiency to address human needs -- as a linear direction in LLM activation space. Using contrastive prompts grounded in the Empathy-in-Action (EIA) benchmark, we test detection and steering across Phi-3-mini-4k (3.8B), Qwen2.5-7B (safety-trained), and Dolphin-Llama-3.1-8B (uncensored). Detection: All models show AUROC 0.996-1.00 at optimal layers. Uncensored Dolphin matches safety-trained models, demonstrating empathy encoding emerges independent of safety training. Phi-3 probes correlate strongly with EIA behavioral scores (r=0.71, p<0.01). Cross-model probe agreement is limited (Qwen: r=-0.06, Dolphin: r=0.18), revealing architecture-specific implementations despite convergent detection. Steering: Qwen achieves 65.3% success with bidirectional control and coherence at extreme interventions. Phi-3 shows 61.7% success with similar coherence. Dolphin exhibits asymmetric steerability: 94.4% success for pro-empathy steering but catastrophic breakdown for anti-empathy (empty outputs, code artifacts). Implications: The detection-steering gap varies by model. Qwen and Phi-3 maintain bidirectional coherence; Dolphin shows robustness only for empathy enhancement. Safety training may affect steering robustness rather than preventing manipulation, though validation across more models is needed.

Detecting and Steering LLMs' Empathy in Action

TL;DR

Abstract

Detecting and Steering LLMs' Empathy in Action

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)