Table of Contents
Fetching ...

Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction

Xin Wei Chia, Swee Liang Wong, Jonathan Pan

Abstract

Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As LLMs serve as sources of guidance, emotional support, and even informal therapy, these risks are poised to escalate. However, studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges, where organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that are difficult to simulate in controlled settings. To address this gap, we developed a Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages established crisis-associated traits and novel subspace steering framework to generate Dark models that exhibits cumulative harmful behavioral patterns. Single-turn and multi-turn evaluations show that our dark models consistently produce harmful interaction and outcomes. Using our Dark models, we propose protective measure to reduce harmful outcomes in Human-AI interactions.

Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction

Abstract

Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As LLMs serve as sources of guidance, emotional support, and even informal therapy, these risks are poised to escalate. However, studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges, where organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that are difficult to simulate in controlled settings. To address this gap, we developed a Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages established crisis-associated traits and novel subspace steering framework to generate Dark models that exhibits cumulative harmful behavioral patterns. Single-turn and multi-turn evaluations show that our dark models consistently produce harmful interaction and outcomes. Using our Dark models, we propose protective measure to reduce harmful outcomes in Human-AI interactions.
Paper Structure (80 sections, 3 equations, 9 figures, 6 tables)

This paper contains 80 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Proposed Framework. Framework consisting of 3 main components. (1) Multi-trait Subspace Steering designed to create a Dark model that can generate harmful human-AI interaction. (2) Evaluation of the Dark model response using both single- and multi-turn probes. (3) Using the Dark models to generate defensive system prompts to mitigate harmful outcomes.
  • Figure 2: Hyperparameter Search Left: Llama-8B; right: Qwen-1.5B. Optimization of coherence and trait scores against baseline (no steering). Data points with black edges indicate configurations used in this paper (refer to Appendix \ref{['Annex- hyperparameter search']} for detailed results
  • Figure 3: Evaluation Results with MultiTraitsss. Left. Single-turn evaluation (One-sided t-test, Bonferroni-Corrected ***p$<$0.001; **p$<$0.01). Error bar indicates SEM. Right. Multi-turn evaluation. Shaded area indicates SEM. Data points with black edges indicate significant turns when compared against baseline (One-sided t-test, Bonferroni-Corrected ***p$<$0.001).
  • Figure 4: UMAP projection Concatenated across 3 models and 3 configurations (Baseline, $\text{Dark}_{\text{coh}}$ and $\text{Dark}_{\text{trait}}$) across turns 1, 3, 5, 10, 15 and 20. Colour indicates responses from Dark models (red) and baseline models (blue). Shading indicates turn number, earlier turns (darker) and later turns (lighter).
  • Figure 5: Effects of Protective Prompts on Dark Models Shaded area indicates SEM. Data points with black edges indicate significant turns when compared against baseline (One-sided t-test, Bonferroni-Corrected ***p$<$0.001).
  • ...and 4 more figures