Table of Contents
Fetching ...

Echo-N1: Affective RL Frontier

Naifan Zhang, Ruihan Sun, Ruixi Su, Shiqi Ma, Shiya Zhang, Xianna Weng, Xiaofan Zhang, Yuhan Zhan, Yuyang Xu, Zhaohan Chen, Zhengyuan Pan, Ziyi Song

TL;DR

This work demonstrates that reinforcement learning can effectively optimize large language models for subjective, emotion-driven conversation by introducing dual reward models—Empathy Reward and Humanlike Reward—and a comprehensive evaluation suite combining quantitative EPM metrics with qualitative NEE judgments. Echo-N1 embodies these ideas, achieving substantial improvements in empathy, emotional coherence, and humanlikeness compared with base models and open-source baselines. The authors also develop an end-to-end data pipeline (SFT and reward-model training) and a dynamic evaluation framework to measure performance across static and dynamic dimensions, establishing a practical pathway for personalizing AI companions. The results indicate that carefully designed reward structures and evaluation protocols can stabilize RL in non-verifiable, human-centered tasks and unlock the potential of affective reinforcement learning as a core component of future dialogue systems.

Abstract

The LLM field has spent a year perfecting RL for tasks machines already excel at, math, code, and deterministic reasoning, while completely sidestepping the domain that actually defines human intelligence: subjective, emotionally grounded, personality sensitive conversation. This space has often been regarded as inherently subjective and challenging to formalize, making it appear unsuitable for conventional RL pipelines. We show that it is not only possible and it is a solvable and transformative RL problem. We propose the first framework that infers user personality on the fly and optimizes model behavior toward personalized conversational preferences. Contrary to the widespread belief that RL collapses in non-verifiable settings, our method produces consistent, robust, and dramatic improvements in humanlike interaction quality. We also introduce the first dynamic emotional intelligence evaluation suite to quantify these gains. Our model, which is introduced as Echo-N1, behaves far above its base version and outperforming the proprietary Doubao 1.5 Character. This work establishes a new frontier for RL: optimizing models for the deeply subjective, deeply human dimensions of conversation.

Echo-N1: Affective RL Frontier

TL;DR

This work demonstrates that reinforcement learning can effectively optimize large language models for subjective, emotion-driven conversation by introducing dual reward models—Empathy Reward and Humanlike Reward—and a comprehensive evaluation suite combining quantitative EPM metrics with qualitative NEE judgments. Echo-N1 embodies these ideas, achieving substantial improvements in empathy, emotional coherence, and humanlikeness compared with base models and open-source baselines. The authors also develop an end-to-end data pipeline (SFT and reward-model training) and a dynamic evaluation framework to measure performance across static and dynamic dimensions, establishing a practical pathway for personalizing AI companions. The results indicate that carefully designed reward structures and evaluation protocols can stabilize RL in non-verifiable, human-centered tasks and unlock the potential of affective reinforcement learning as a core component of future dialogue systems.

Abstract

The LLM field has spent a year perfecting RL for tasks machines already excel at, math, code, and deterministic reasoning, while completely sidestepping the domain that actually defines human intelligence: subjective, emotionally grounded, personality sensitive conversation. This space has often been regarded as inherently subjective and challenging to formalize, making it appear unsuitable for conventional RL pipelines. We show that it is not only possible and it is a solvable and transformative RL problem. We propose the first framework that infers user personality on the fly and optimizes model behavior toward personalized conversational preferences. Contrary to the widespread belief that RL collapses in non-verifiable settings, our method produces consistent, robust, and dramatic improvements in humanlike interaction quality. We also introduce the first dynamic emotional intelligence evaluation suite to quantify these gains. Our model, which is introduced as Echo-N1, behaves far above its base version and outperforming the proprietary Doubao 1.5 Character. This work establishes a new frontier for RL: optimizing models for the deeply subjective, deeply human dimensions of conversation.

Paper Structure

This paper contains 44 sections, 9 equations, 20 figures, 10 tables.

Figures (20)

  • Figure 1: Comprehensive Adaptability Analysis of EPM-Q: (a)EPM-Q Adaptability Analysis: Mechanism Stress Test (Routine vs. Challenging Scenarios); (b)EPM-Q Adaptability Analysis: Persona Resilience (Need Type & Empathy Threshold); (c)EPM-Q Adaptability Analysis: Scenario Categories (Performance across Different Domains)
  • Figure 2: Overview of our character–user interaction pipeline. We first construct AI character profiles by extracting concise descriptors from books, films, Wikipedia, or LLM-generated summaries of classic IP characters. These descriptors are then expanded into full character system prompt (SP) using our AI-Character Meta-SP generator. On the user side, a lightweight LLM is used to produce an initial profile—e.g., gender, speech style, age, MBTI, which is subsequently enriched into a detailed user SP via our User Meta-SP generator. During interaction, the AI character and the user profile are fed into two separate dialogue models to produce responses. At a high level, a director agent is invoked every five turns to regulate the conversational flow: based on the dialogue history and both SPs, it decides whether to maintain the current topic or initiate a new one.
  • Figure 3: The overall pipeline of reward model training data. The process begins by filtering human annotated dialogues to isolate contextually relevant and empathy-requiring scenarios while excluding unsafe content. Subsequently, a principle-driven Critique-Rewrite framework analyzes user personas to generate graded responses and iteratively refines suboptimal outputs into golden versions. These high-quality responses are finally paired with lower-scoring candidates to construct the preference dataset.
  • Figure 4: Illustration of the SFT dataset composition. The dataset integrates our proprietary AI companionship data with several open-source datasets, balancing domain specialization and general coverage.
  • Figure 5: Comprehensive AI companionship evaluation pipeline
  • ...and 15 more figures