Table of Contents
Fetching ...

Rhea: Role-aware Heuristic Episodic Attention for Conversational LLMs

Wanyang Hong, Zhaoning Zhang, Yi Chen, Libo Zhang, Baihui Liu, Linbo Qiao, Zhiliang Tian, Dongsheng Li

TL;DR

Multi-turn LLMs suffer from cumulative contextual decay due to attention pollution, dilution, and drift. Rhea introduces a role-aware memory architecture with an Instructional Memory for persistent global constraints and an Episodic Memory for dynamic interactions, coupled with a heuristic context retrieval mechanism and embedding-level reconstruction to maintain high signal-to-noise context. Empirical results across MT-Bench, MT-Eval, and Long-MT-Bench+ show substantial improvements in long-horizon accuracy and instruction fidelity, including a 16% relative gain and IAR > 8.1, with only modest latency overhead. Ablation studies demonstrate the necessity of both memory streams and the retrieval strategy, underscoring a shift from expanding context windows to improving the quality and structure of context for robust conversational LLMs.

Abstract

Large Language Models (LLMs) have achieved remarkable performance on single-turn tasks, yet their effectiveness deteriorates in multi-turn conversations. We define this phenomenon as cumulative contextual decay - a progressive degradation of contextual integrity caused by attention pollution, dilution, and drift. To address this challenge, we propose Rhea (Role-aware Heuristic Episodic Attention), a novel framework that decouples conversation history into two functionally independent memory modules: (1) an Instructional Memory (IM) that persistently stores high-fidelity global constraints via a structural priority mechanism, and (2) an Episodic Memory (EM) that dynamically manages user-model interactions via asymmetric noise control and heuristic context retrieval. During inference, Rhea constructs a high signal-to-noise context by applying its priority attention: selectively integrating relevant episodic information while always prioritizing global instructions. To validate this approach, experiments on multiple multi-turn conversation benchmarks - including MT-Eval and Long-MT-Bench+ - show that Rhea mitigates performance decay and improves overall accuracy by 1.04 points on a 10-point scale (a 16% relative gain over strong baselines). Moreover, Rhea maintains near-perfect instruction fidelity (IAR > 8.1) across long-horizon interactions. These results demonstrate that Rhea provides a principled and effective framework for building more precise, instruction-consistent conversational LLMs.

Rhea: Role-aware Heuristic Episodic Attention for Conversational LLMs

TL;DR

Multi-turn LLMs suffer from cumulative contextual decay due to attention pollution, dilution, and drift. Rhea introduces a role-aware memory architecture with an Instructional Memory for persistent global constraints and an Episodic Memory for dynamic interactions, coupled with a heuristic context retrieval mechanism and embedding-level reconstruction to maintain high signal-to-noise context. Empirical results across MT-Bench, MT-Eval, and Long-MT-Bench+ show substantial improvements in long-horizon accuracy and instruction fidelity, including a 16% relative gain and IAR > 8.1, with only modest latency overhead. Ablation studies demonstrate the necessity of both memory streams and the retrieval strategy, underscoring a shift from expanding context windows to improving the quality and structure of context for robust conversational LLMs.

Abstract

Large Language Models (LLMs) have achieved remarkable performance on single-turn tasks, yet their effectiveness deteriorates in multi-turn conversations. We define this phenomenon as cumulative contextual decay - a progressive degradation of contextual integrity caused by attention pollution, dilution, and drift. To address this challenge, we propose Rhea (Role-aware Heuristic Episodic Attention), a novel framework that decouples conversation history into two functionally independent memory modules: (1) an Instructional Memory (IM) that persistently stores high-fidelity global constraints via a structural priority mechanism, and (2) an Episodic Memory (EM) that dynamically manages user-model interactions via asymmetric noise control and heuristic context retrieval. During inference, Rhea constructs a high signal-to-noise context by applying its priority attention: selectively integrating relevant episodic information while always prioritizing global instructions. To validate this approach, experiments on multiple multi-turn conversation benchmarks - including MT-Eval and Long-MT-Bench+ - show that Rhea mitigates performance decay and improves overall accuracy by 1.04 points on a 10-point scale (a 16% relative gain over strong baselines). Moreover, Rhea maintains near-perfect instruction fidelity (IAR > 8.1) across long-horizon interactions. These results demonstrate that Rhea provides a principled and effective framework for building more precise, instruction-consistent conversational LLMs.

Paper Structure

This paper contains 46 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An example of cumulative contextual decay.The model correctly adheres to the global instruction in Turn 1. However, as the conversation continues, it fails to maintain this instruction in Turn 2 and 3, driven by an interaction of attention drift, pollute and dilution.
  • Figure 2: Overview of the Rhea Framework. (Left) The core architecture illustrating the decoupling of conversation history into Instructional Memory (IM) and Episodic Memory (EM) via instruction recognition and compression, followed by Heuristic Context Retrieval (HCR). (Right) A comparison of inference pipelines, demonstrating Rhea's hybrid context construction (combining instructions, text, and embeddings) in contrast to standard and naive compression baselines.
  • Figure 3: Implementation of Episodic Memory via Latent Compression. The framework utilizes a dual-LoRA architecture sharing a single LLM backbone. The Compression Module ($LoRA_{compress}$) encodes history into compact latent embeddings ($V_k$), while the Generation Module ($LoRA_{generate}$) processes the hybrid context to generate reply.
  • Figure 4: Performance analysis mitigating cumulative contextual decay. (Left) Instruction Adherence Rate (IAR) over turns, showing Rhea's resistance to Attention Drift on specific constraints. (Right) Response quality on Long-MT-Bench+ (60+ turns), demonstrating Rhea's robustness against Attention Dilution in extended interactions compared to the degrading Vanilla baseline.
  • Figure 5: Analysis of recognizer error types on IAR performance. The Real-World performance (orange) closely tracks the Oracle (gray), demonstrating the framework's high robustness to FP errors. In contrast, FN errors (blue) are catastrophic.
  • ...and 2 more figures