Table of Contents
Fetching ...

DMA: Online RAG Alignment with Human Feedback

Yu Bai, Yukai Miao, Dawei Wang, Li Chen, Fei Long, Rundi Zhai, Dan Li, Yanyu Ren, Tianfeng Liu, Hongtao Xie, Ce Yang, Xuhui Cai

TL;DR

DMA addresses the non-stationary nature of real-world RAG by turning multi-granularity human feedback into a continuous control loop over retrieval. It formalizes a three-tier supervision pipeline—document-level, list-level, and response-level—plus nearline online updates and distillation into a low-latency scorer for serving. The approach yields substantial online gains in user satisfaction and competitive offline performance on conversational QA benchmarks, demonstrating effective real-time adaptation without sacrificing baseline retrieval capability. DMA thereby reframes alignment for RAG as memory-control, enabling continual, human-guided improvements in interactive AI systems with practical deployment implications.

Abstract

Retrieval-augmented generation (RAG) systems often rely on static retrieval, limiting adaptation to evolving intent and content drift. We introduce Dynamic Memory Alignment (DMA), an online learning framework that systematically incorporates multi-granularity human feedback to align ranking in interactive settings. DMA organizes document-, list-, and response-level signals into a coherent learning pipeline: supervised training for pointwise and listwise rankers, policy optimization driven by response-level preferences, and knowledge distillation into a lightweight scorer for low-latency serving. Throughout this paper, memory refers to the model's working memory, which is the entire context visible to the LLM for In-Context Learning. We adopt a dual-track evaluation protocol mirroring deployment: (i) large-scale online A/B ablations to isolate the utility of each feedback source, and (ii) few-shot offline tests on knowledge-intensive benchmarks. Online, a multi-month industrial deployment further shows substantial improvements in human engagement. Offline, DMA preserves competitive foundational retrieval while yielding notable gains on conversational QA (TriviaQA, HotpotQA). Taken together, these results position DMA as a principled approach to feedback-driven, real-time adaptation in RAG without sacrificing baseline capability.

DMA: Online RAG Alignment with Human Feedback

TL;DR

DMA addresses the non-stationary nature of real-world RAG by turning multi-granularity human feedback into a continuous control loop over retrieval. It formalizes a three-tier supervision pipeline—document-level, list-level, and response-level—plus nearline online updates and distillation into a low-latency scorer for serving. The approach yields substantial online gains in user satisfaction and competitive offline performance on conversational QA benchmarks, demonstrating effective real-time adaptation without sacrificing baseline retrieval capability. DMA thereby reframes alignment for RAG as memory-control, enabling continual, human-guided improvements in interactive AI systems with practical deployment implications.

Abstract

Retrieval-augmented generation (RAG) systems often rely on static retrieval, limiting adaptation to evolving intent and content drift. We introduce Dynamic Memory Alignment (DMA), an online learning framework that systematically incorporates multi-granularity human feedback to align ranking in interactive settings. DMA organizes document-, list-, and response-level signals into a coherent learning pipeline: supervised training for pointwise and listwise rankers, policy optimization driven by response-level preferences, and knowledge distillation into a lightweight scorer for low-latency serving. Throughout this paper, memory refers to the model's working memory, which is the entire context visible to the LLM for In-Context Learning. We adopt a dual-track evaluation protocol mirroring deployment: (i) large-scale online A/B ablations to isolate the utility of each feedback source, and (ii) few-shot offline tests on knowledge-intensive benchmarks. Online, a multi-month industrial deployment further shows substantial improvements in human engagement. Offline, DMA preserves competitive foundational retrieval while yielding notable gains on conversational QA (TriviaQA, HotpotQA). Taken together, these results position DMA as a principled approach to feedback-driven, real-time adaptation in RAG without sacrificing baseline capability.

Paper Structure

This paper contains 52 sections, 14 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: DMA overview. Multi-level human feedback is captured, modeled, and fused to guide online retrieval/reranking. Reranker alignment and distillation for serving are detailed in Figure \ref{['fig:final-reranker-aligned']}.
  • Figure 2: Training-to-serving pathway in DMA Document-, list-, and response-level feedback supervise retrieval-side teachers. The listwise policy is aligned with the reward via PPO under a Plackett–Luce policy. All teacher logits are then distilled into a compact GBDT scorer for sub-10 ms online reranking, independent of the LLM generator.