Table of Contents
Fetching ...

Improve Large Language Model Systems with User Logs

Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

TL;DR

UNO addresses the challenge of evolving LLM systems from real user logs, tackling the scarcity and noisiness of such data. It distills user interactions into semi-structured rules, clusters by query and rule context, and uses cognitive-gap assessment to decide between a Primary Experience Module (Expert LoRA) and a Reflective Experience Module (Critic LoRA) for inference-time refinement, while avoiding direct updates to the base model. The framework also employs a simulated performance verifier to guard against noise and ensures robust off-policy learning through module-level adaptation. Experiments on MemoryBench show state-of-the-art effectiveness and efficiency over RAG and memory-based baselines, with ablations highlighting the critical roles of clustering, dual-path adaptation, and filtering in safely leveraging user logs. The work advances practical, lifelong learning for deployed LLM systems and provides open-source code for reproducibility and further development, enabling scalable, data-efficient improvement in real-world deployments.

Abstract

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

Improve Large Language Model Systems with User Logs

TL;DR

UNO addresses the challenge of evolving LLM systems from real user logs, tackling the scarcity and noisiness of such data. It distills user interactions into semi-structured rules, clusters by query and rule context, and uses cognitive-gap assessment to decide between a Primary Experience Module (Expert LoRA) and a Reflective Experience Module (Critic LoRA) for inference-time refinement, while avoiding direct updates to the base model. The framework also employs a simulated performance verifier to guard against noise and ensures robust off-policy learning through module-level adaptation. Experiments on MemoryBench show state-of-the-art effectiveness and efficiency over RAG and memory-based baselines, with ablations highlighting the critical roles of clustering, dual-path adaptation, and filtering in safely leveraging user logs. The work advances practical, lifelong learning for deployed LLM systems and provides open-source code for reproducibility and further development, enabling scalable, data-efficient improvement in real-world deployments.

Abstract

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .
Paper Structure (26 sections, 2 theorems, 4 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 2 theorems, 4 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.2

For data with small cognitive gaps, the posterior probability of noise $P(N|g_i \leq \tau)$ is strictly bounded:

Figures (3)

  • Figure 1: The workflow of UNO. UNO first distills and filters raw user logs, then performs clustering and a cognitive gap assessment to select the type of experience module (primary or reflective). At inference time, UNO identifies the appropriate cluster and applies an inference strategy aligned with the type of that cluster.
  • Figure 2: Comparison of extra input tokens versus performance (Norm-Score). Extra tokens are computed using the Qwen3-8B tokenizer. Better performance lies toward the upper-left. We report UNO-Single because full UNO may trigger an additional critique-and-revise step (multiple LLM calls and extra output tokens), which is not captured by the "extra input tokens" metric.
  • Figure 3: Results of online evolution settings on phi-4 model. "Offline" represents the setting in the main experiments.

Theorems & Definitions (2)

  • Theorem 3.2: Noise Risk Bound
  • Theorem 3.4: Variance Reduction via Clustering