HEART: Emotionally-driven test-time scaling of Language Models
Gabriela Pinto, Palash Goyal, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Tomas Pfister, Hamid Palangi
TL;DR
HEART proposes emotionally driven, iterative self-correction for language models by embedding Affective Cue Prompts derived from Ekman’s six emotions into a multi-step refinement loop. The framework leverages two resolution modes—an oracle verifier (S1) and a verifier-free generator (S2)—to assess both upper-bound potential and real-world viability, reporting substantial gains under S1 across OlympiadBench, Humanity’s Last Exam, and SimpleQA. Ablation studies show that dynamic, alternating affective cues are crucial for performance, while verifier-free results reveal a practical bottleneck in autonomous selection. Overall, HEART advances a new frontier in machine reasoning by uniting structured reasoning with affect-informed motivation, though it requires careful handling of ethical considerations and further work on adaptive cue selection and multimodal extension.
Abstract
Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies such as self-reflection primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART--a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a model's incorrect response using a curated set of concise, emotionally charged phrases based on the six universal emotions categorized by Dr. Paul Ekman. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity's Last Exam, and SimpleQA. Our results reveal a significant new phenomenon: when guided by an oracle verifier, this affective iteration protocol unlocks significantly deeper reasoning, leading to consistent and substantial increases in accuracy over state-of-the-art baselines with the same verifier. However, we also identify a critical bottleneck for practical deployment. In a verifier-free setting, it struggles to harness these gains consistently, highlighting as a key challenge for future work. Our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the `HEART' of the models.
