Table of Contents
Fetching ...

Detecting and Steering LLMs' Empathy in Action

Juan P. Cadile

TL;DR

The paper investigates empathy-in-action as a linear direction in LLM activation space, proposing activation probes to detect empathic reasoning and assess the ability to steer such behavior. By evaluating Phi-3-mini-4k, Qwen2.5-7B, and Dolphin-Llama-3.1-8B, it demonstrates near-perfect within-model detection (AUROC ≈ $0.996$–$1.00$) with strong behavioral correlations (e.g., $r=0.71$), and reveals a model-specific steering landscape: safety-trained Qwen achieves robust bidirectional steering, Dolphin exhibits strong pro-empathy steering but catastrophic anti-empathy failures, and Phi-3 shows moderate steering with coherence similar to safety-trained models. Crucially, cross-model probe agreement is weak, indicating architecture-dependent geometric implementations despite convergent detection, and safety training appears to enhance steering robustness more than preventing manipulation. The findings underscore the need for model-aware interpretability tools and cross-model alignment strategies, while suggesting that real-world deployment should account for varying steerability across architectures and scenarios.

Abstract

We investigate empathy-in-action -- the willingness to sacrifice task efficiency to address human needs -- as a linear direction in LLM activation space. Using contrastive prompts grounded in the Empathy-in-Action (EIA) benchmark, we test detection and steering across Phi-3-mini-4k (3.8B), Qwen2.5-7B (safety-trained), and Dolphin-Llama-3.1-8B (uncensored). Detection: All models show AUROC 0.996-1.00 at optimal layers. Uncensored Dolphin matches safety-trained models, demonstrating empathy encoding emerges independent of safety training. Phi-3 probes correlate strongly with EIA behavioral scores (r=0.71, p<0.01). Cross-model probe agreement is limited (Qwen: r=-0.06, Dolphin: r=0.18), revealing architecture-specific implementations despite convergent detection. Steering: Qwen achieves 65.3% success with bidirectional control and coherence at extreme interventions. Phi-3 shows 61.7% success with similar coherence. Dolphin exhibits asymmetric steerability: 94.4% success for pro-empathy steering but catastrophic breakdown for anti-empathy (empty outputs, code artifacts). Implications: The detection-steering gap varies by model. Qwen and Phi-3 maintain bidirectional coherence; Dolphin shows robustness only for empathy enhancement. Safety training may affect steering robustness rather than preventing manipulation, though validation across more models is needed.

Detecting and Steering LLMs' Empathy in Action

TL;DR

The paper investigates empathy-in-action as a linear direction in LLM activation space, proposing activation probes to detect empathic reasoning and assess the ability to steer such behavior. By evaluating Phi-3-mini-4k, Qwen2.5-7B, and Dolphin-Llama-3.1-8B, it demonstrates near-perfect within-model detection (AUROC ≈ ) with strong behavioral correlations (e.g., ), and reveals a model-specific steering landscape: safety-trained Qwen achieves robust bidirectional steering, Dolphin exhibits strong pro-empathy steering but catastrophic anti-empathy failures, and Phi-3 shows moderate steering with coherence similar to safety-trained models. Crucially, cross-model probe agreement is weak, indicating architecture-dependent geometric implementations despite convergent detection, and safety training appears to enhance steering robustness more than preventing manipulation. The findings underscore the need for model-aware interpretability tools and cross-model alignment strategies, while suggesting that real-world deployment should account for varying steerability across architectures and scenarios.

Abstract

We investigate empathy-in-action -- the willingness to sacrifice task efficiency to address human needs -- as a linear direction in LLM activation space. Using contrastive prompts grounded in the Empathy-in-Action (EIA) benchmark, we test detection and steering across Phi-3-mini-4k (3.8B), Qwen2.5-7B (safety-trained), and Dolphin-Llama-3.1-8B (uncensored). Detection: All models show AUROC 0.996-1.00 at optimal layers. Uncensored Dolphin matches safety-trained models, demonstrating empathy encoding emerges independent of safety training. Phi-3 probes correlate strongly with EIA behavioral scores (r=0.71, p<0.01). Cross-model probe agreement is limited (Qwen: r=-0.06, Dolphin: r=0.18), revealing architecture-specific implementations despite convergent detection. Steering: Qwen achieves 65.3% success with bidirectional control and coherence at extreme interventions. Phi-3 shows 61.7% success with similar coherence. Dolphin exhibits asymmetric steerability: 94.4% success for pro-empathy steering but catastrophic breakdown for anti-empathy (empty outputs, code artifacts). Implications: The detection-steering gap varies by model. Qwen and Phi-3 maintain bidirectional coherence; Dolphin shows robustness only for empathy enhancement. Safety training may affect steering robustness rather than preventing manipulation, though validation across more models is needed.

Paper Structure

This paper contains 62 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: AUROC by layer for all three models. Phi-3 layer 12 and Qwen layer 16 achieve perfect discrimination (AUROC 1.0), while Dolphin layer 8 achieves 0.996. All models show peak performance in middle layers (8-16) before task-specific variance dominates deeper layers.
  • Figure 2: Random baseline validation showing all three models' probe performance vs 100 random unit vectors (Phi-3 distribution shown in histogram). All three empathy probes (Phi-3 L12 AUROC 1.0, Qwen L16 AUROC 1.0, Dolphin L8 AUROC 0.996) significantly exceed the 95th percentile (orange line, 0.857) with z=2.09 (p<0.05).
  • Figure 3: Lexical ablation results for Phi-3-mini-4k layer 12. Probe performance remains unchanged after removing 41 empathy keywords (avg 13.5 per pair), confirming semantic rather than lexical detection.
  • Figure 4: Cross-model layer comparison. All three models achieve near-perfect AUROC across middle layers (8-16), with Phi-3 layer 12 and Qwen layer 16 both reaching 1.0 and Dolphin layer 8 achieving 0.996. Consistent within-model detection across 3.8B to 8B parameters demonstrates empathy as a robustly encoded semantic feature in modern transformer-based LLMs.
  • Figure 5: Behavioral correlation for Phi-3-mini-4k layer 8. Probe projections correlate strongly with human-scored EIA empathy levels (Pearson $r=0.71$, $p=0.010$). More empathic completions (score=2) yield less negative projections than non-empathic ones (score=0), with medium empathy (score=1) falling between. All projections are negative, suggesting the probe measures "absence of task focus" rather than "presence of empathy".
  • ...and 4 more figures