Table of Contents
Fetching ...

AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents

Tianyi Li, Zixuan Wang, Guidong Lei, Xiaodong Li, Hui Li

Abstract

Recommender agents built on Large Language Models offer a promising paradigm for recommendation. However, existing recommender agents typically suffer from a disconnect between intermediate reasoning and final ranking feedback, and are unable to capture fine-grained preferences. To address this, we present AgenticRec, a ranking-oriented agentic recommendation framework that optimizes the entire decision-making trajectory (including intermediate reasoning, tool invocation, and final ranking list generation) under sparse implicit feedback. Our approach makes three key contributions. First, we design a suite of recommendation-specific tools integrated into a ReAct loop to support evidence-grounded reasoning. Second, we propose theoretically unbiased List-Wise Group Relative Policy Optimization (list-wise GRPO) to maximize ranking utility, ensuring accurate credit assignment for complex tool-use trajectories. Third, we introduce Progressive Preference Refinement (PPR) to resolve fine-grained preference ambiguities. By mining hard negatives from ranking violations and applying bidirectional preference alignment, PPR minimizes the convex upper bound of pairwise ranking errors. Experiments on benchmarks confirm that AgenticRec significantly outperforms baselines, validating the necessity of unifying reasoning, tool use, and ranking optimization.

AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents

Abstract

Recommender agents built on Large Language Models offer a promising paradigm for recommendation. However, existing recommender agents typically suffer from a disconnect between intermediate reasoning and final ranking feedback, and are unable to capture fine-grained preferences. To address this, we present AgenticRec, a ranking-oriented agentic recommendation framework that optimizes the entire decision-making trajectory (including intermediate reasoning, tool invocation, and final ranking list generation) under sparse implicit feedback. Our approach makes three key contributions. First, we design a suite of recommendation-specific tools integrated into a ReAct loop to support evidence-grounded reasoning. Second, we propose theoretically unbiased List-Wise Group Relative Policy Optimization (list-wise GRPO) to maximize ranking utility, ensuring accurate credit assignment for complex tool-use trajectories. Third, we introduce Progressive Preference Refinement (PPR) to resolve fine-grained preference ambiguities. By mining hard negatives from ranking violations and applying bidirectional preference alignment, PPR minimizes the convex upper bound of pairwise ranking errors. Experiments on benchmarks confirm that AgenticRec significantly outperforms baselines, validating the necessity of unifying reasoning, tool use, and ranking optimization.
Paper Structure (37 sections, 2 theorems, 11 equations, 7 figures, 3 tables)

This paper contains 37 sections, 2 theorems, 11 equations, 7 figures, 3 tables.

Key Result

proposition 1

The list-wise GRPO gradient estimator, which utilizes the list-wise ranking metric $R(r_K, y)$ (i.e., NDCG@K) as the reward and the group average ranking score as the baseline, provides an unbiased estimate of the gradient for the expected list-wise utility objective $J(\theta) = \mathbb{E}_{\tau \s

Figures (7)

  • Figure 1: Compared to existing training-free recommender agents like RecMind RecMind and InteRecAgent InteRecAgent (left), AgenticRec (right) learns outcome-driven tool use and evidence-grounded ranking policies from interaction logs under implicit-feedback list-wise rewards.
  • Figure 2: Overview of AgenticRec.
  • Figure 3: Training statistics on the Office dataset: (a) tool usage statistics, where the blue line denotes the average number of tool calls per trajectory and the orange line indicates the percentage of positively rewarded trajectories that have tool invocation; (b) recommendation performance measured by H@10 during training.
  • Figure 4: Effect of group size in list-wise GRPO
  • Figure 5: Performance of AgenticRec with varying backbone sizes.
  • ...and 2 more figures

Theorems & Definitions (2)

  • proposition 1: Unbiasedness of List-wise GRPO Gradient
  • proposition 2: Error Bound Minimization via Bidirectional Preference Reasoning