MODRL-TA:A Multi-Objective Deep Reinforcement Learning Framework for Traffic Allocation in E-Commerce Search

Peng Cheng; Huimu Wang; Jinyuan Zhao; Yihao Wang; Enqiang Xu; Yu Zhao; Zhuojian Xiao; Songlin Wang; Guoyu Tang; Lin Liu; Sulong Xu

MODRL-TA:A Multi-Objective Deep Reinforcement Learning Framework for Traffic Allocation in E-Commerce Search

Peng Cheng, Huimu Wang, Jinyuan Zhao, Yihao Wang, Enqiang Xu, Yu Zhao, Zhuojian Xiao, Songlin Wang, Guoyu Tang, Lin Liu, Sulong Xu

TL;DR

This work tackles the problem of multi-objective traffic allocation in e-commerce search by formulating it as an MDP and proposing MODRL-TA, a three-component framework that combines: (1) multiple objective-specific Q-learning models (MOQ) that can be expanded with new objectives, (2) a Cross-Entropy Method-based Decision Fusion Module (DFM) that dynamically tunes objective weights to maximize a joint value, and (3) a Progressive Data Augmentation (PDA) system that bootstraps training with offline data and gradually incorporates real online data to mitigate cold-start and distribution-shift. The method yields significant improvements over baselines in both offline benchmarks and online A/B tests, with full deployment showing notable gains in impressions, CTR, and CVR, and practical robustness to changing merchant objectives. Key contributions include scalable ensemble RL for multiple goals, real-time weight fusion via CEM, and a data-augmented cold-start strategy that transitions from simulated to real data. Overall, MODRL-TA provides a practical, deployable solution for dynamic, multi-objective traffic allocation in large-scale e-commerce search platforms.

Abstract

Traffic allocation is a process of redistributing natural traffic to products by adjusting their positions in the post-search phase, aimed at effectively fostering merchant growth, precisely meeting customer demands, and ensuring the maximization of interests across various parties within e-commerce platforms. Existing methods based on learning to rank neglect the long-term value of traffic allocation, whereas approaches of reinforcement learning suffer from balancing multiple objectives and the difficulties of cold starts within realworld data environments. To address the aforementioned issues, this paper propose a multi-objective deep reinforcement learning framework consisting of multi-objective Q-learning (MOQ), a decision fusion algorithm (DFM) based on the cross-entropy method(CEM), and a progressive data augmentation system(PDA). Specifically. MOQ constructs ensemble RL models, each dedicated to an objective, such as click-through rate, conversion rate, etc. These models individually determine the position of items as actions, aiming to estimate the long-term value of multiple objectives from an individual perspective. Then we employ DFM to dynamically adjust weights among objectives to maximize long-term value, addressing temporal dynamics in objective preferences in e-commerce scenarios. Initially, PDA trained MOQ with simulated data from offline logs. As experiments progressed, it strategically integrated real user interaction data, ultimately replacing the simulated dataset to alleviate distributional shifts and the cold start problem. Experimental results on real-world online e-commerce systems demonstrate the significant improvements of MODRL-TA, and we have successfully deployed MODRL-TA on an e-commerce search platform.

MODRL-TA:A Multi-Objective Deep Reinforcement Learning Framework for Traffic Allocation in E-Commerce Search

TL;DR

Abstract

MODRL-TA:A Multi-Objective Deep Reinforcement Learning Framework for Traffic Allocation in E-Commerce Search

Authors

TL;DR

Abstract

Table of Contents

Figures (2)