Table of Contents
Fetching ...

HEART: Emotionally-driven test-time scaling of Language Models

Gabriela Pinto, Palash Goyal, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Tomas Pfister, Hamid Palangi

TL;DR

HEART proposes emotionally driven, iterative self-correction for language models by embedding Affective Cue Prompts derived from Ekman’s six emotions into a multi-step refinement loop. The framework leverages two resolution modes—an oracle verifier (S1) and a verifier-free generator (S2)—to assess both upper-bound potential and real-world viability, reporting substantial gains under S1 across OlympiadBench, Humanity’s Last Exam, and SimpleQA. Ablation studies show that dynamic, alternating affective cues are crucial for performance, while verifier-free results reveal a practical bottleneck in autonomous selection. Overall, HEART advances a new frontier in machine reasoning by uniting structured reasoning with affect-informed motivation, though it requires careful handling of ethical considerations and further work on adaptive cue selection and multimodal extension.

Abstract

Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies such as self-reflection primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART--a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a model's incorrect response using a curated set of concise, emotionally charged phrases based on the six universal emotions categorized by Dr. Paul Ekman. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity's Last Exam, and SimpleQA. Our results reveal a significant new phenomenon: when guided by an oracle verifier, this affective iteration protocol unlocks significantly deeper reasoning, leading to consistent and substantial increases in accuracy over state-of-the-art baselines with the same verifier. However, we also identify a critical bottleneck for practical deployment. In a verifier-free setting, it struggles to harness these gains consistently, highlighting as a key challenge for future work. Our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the `HEART' of the models.

HEART: Emotionally-driven test-time scaling of Language Models

TL;DR

HEART proposes emotionally driven, iterative self-correction for language models by embedding Affective Cue Prompts derived from Ekman’s six emotions into a multi-step refinement loop. The framework leverages two resolution modes—an oracle verifier (S1) and a verifier-free generator (S2)—to assess both upper-bound potential and real-world viability, reporting substantial gains under S1 across OlympiadBench, Humanity’s Last Exam, and SimpleQA. Ablation studies show that dynamic, alternating affective cues are crucial for performance, while verifier-free results reveal a practical bottleneck in autonomous selection. Overall, HEART advances a new frontier in machine reasoning by uniting structured reasoning with affect-informed motivation, though it requires careful handling of ethical considerations and further work on adaptive cue selection and multimodal extension.

Abstract

Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies such as self-reflection primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART--a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a model's incorrect response using a curated set of concise, emotionally charged phrases based on the six universal emotions categorized by Dr. Paul Ekman. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity's Last Exam, and SimpleQA. Our results reveal a significant new phenomenon: when guided by an oracle verifier, this affective iteration protocol unlocks significantly deeper reasoning, leading to consistent and substantial increases in accuracy over state-of-the-art baselines with the same verifier. However, we also identify a critical bottleneck for practical deployment. In a verifier-free setting, it struggles to harness these gains consistently, highlighting as a key challenge for future work. Our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the `HEART' of the models.

Paper Structure

This paper contains 30 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: An illustration of the HEART framework. The process begins when a task is sent to a large language model (LLM), which returns a response. An oracle then evaluates the response against the ground truth. If the response is incorrect, the HEART process begins, incorporating the original task, the LLM's response, and selected affective cue prompts to generate a new, improved response.
  • Figure 2: Final accuracy of Gemini 2.5 Flash under static and dynamic affective prompting strategies. Dynamic sequences involve prompts that change mid-task. Notations are defined in Appendix \ref{['sec:pattern_notations']} for notations.
  • Figure 3: Performance (measured in cumulative accuracy) at each iteration t with "Wait" (blue), CoT (yellow), Self Reflection (green) and HEART (pink) on HLE with Gemini 2.5 Flash.
  • Figure 4: The 10 Best Performing Emotion Patterns using HEART on OlympiadBench Math with Gemini 2.5 Flash.
  • Figure 5: Gemini 2.5 Flash Accuracy per Iteration on OlympiadBench Physics Open Ended Problems using HEART.
  • ...and 1 more figures