DMA: Online RAG Alignment with Human Feedback

Yu Bai; Yukai Miao; Dawei Wang; Li Chen; Fei Long; Rundi Zhai; Dan Li; Yanyu Ren; Tianfeng Liu; Hongtao Xie; Ce Yang; Xuhui Cai

DMA: Online RAG Alignment with Human Feedback

Yu Bai, Yukai Miao, Dawei Wang, Li Chen, Fei Long, Rundi Zhai, Dan Li, Yanyu Ren, Tianfeng Liu, Hongtao Xie, Ce Yang, Xuhui Cai

TL;DR

DMA addresses the non-stationary nature of real-world RAG by turning multi-granularity human feedback into a continuous control loop over retrieval. It formalizes a three-tier supervision pipeline—document-level, list-level, and response-level—plus nearline online updates and distillation into a low-latency scorer for serving. The approach yields substantial online gains in user satisfaction and competitive offline performance on conversational QA benchmarks, demonstrating effective real-time adaptation without sacrificing baseline retrieval capability. DMA thereby reframes alignment for RAG as memory-control, enabling continual, human-guided improvements in interactive AI systems with practical deployment implications.

Abstract

Retrieval-augmented generation (RAG) systems often rely on static retrieval, limiting adaptation to evolving intent and content drift. We introduce Dynamic Memory Alignment (DMA), an online learning framework that systematically incorporates multi-granularity human feedback to align ranking in interactive settings. DMA organizes document-, list-, and response-level signals into a coherent learning pipeline: supervised training for pointwise and listwise rankers, policy optimization driven by response-level preferences, and knowledge distillation into a lightweight scorer for low-latency serving. Throughout this paper, memory refers to the model's working memory, which is the entire context visible to the LLM for In-Context Learning. We adopt a dual-track evaluation protocol mirroring deployment: (i) large-scale online A/B ablations to isolate the utility of each feedback source, and (ii) few-shot offline tests on knowledge-intensive benchmarks. Online, a multi-month industrial deployment further shows substantial improvements in human engagement. Offline, DMA preserves competitive foundational retrieval while yielding notable gains on conversational QA (TriviaQA, HotpotQA). Taken together, these results position DMA as a principled approach to feedback-driven, real-time adaptation in RAG without sacrificing baseline capability.

DMA: Online RAG Alignment with Human Feedback

TL;DR

Abstract

DMA: Online RAG Alignment with Human Feedback

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)