AgentDR Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W. Huang, Philip S. Yu, Karthik Subbian, Danai Kourta
TL;DR
AgentDR tackles hallucination and scalability in LLM-driven recommendations by grounding full-ranking tasks in traditional tools and using LLMs to infer user intent over implicit substitutes and complements. The framework combines a profile and two memories with substitute/complement generation, personalized tool selection, and a Dual S&C reranking pipeline, achieving superior full-ranking performance on three grocery datasets. A new Vicinity-DCG (VDCG) metric is introduced to jointly assess semantic alignment and ranking correctness, highlighting the value of relational reasoning in matching user intent. Practically, AgentDR demonstrates scalable, semantically aware recommendations with strong gains over baselines and adaptable fusion of multiple ranking signals for large catalogs.
Abstract
Recent agent-based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non-existent items and full-catalog ranking. Besides, a largely underexplored opportunity lies in leveraging LLMs'commonsense reasoning to capture user intent through substitute and complement relationships between items, which are usually implicit in datasets and difficult for traditional ID-based recommenders to capture. In this work, we propose a novel LLM-agent framework, AgenDR, which bridges LLM reasoning with scalable recommendation tools. Our approach delegates full-ranking tasks to traditional models while utilizing LLMs to (i) integrate multiple recommendation outputs based on personalized tool suitability and (ii) reason over substitute and complement relationships grounded in user history. This design mitigates hallucination, scales to large catalogs, and enhances recommendation relevance through relational reasoning. Through extensive experiments on three public grocery datasets, we show that our framework achieves superior full-ranking performance, yielding on average a twofold improvement over its underlying tools. We also introduce a new LLM-based evaluation metric that jointly measures semantic alignment and ranking correctness.
