ShuffleGate: Scalable Feature Optimization for Recommender Systems via Batch-wise Sensitivity Learning
Yihong Huang, Chen Chu, Fan Zhang, Liping Wang Fei Chen, Yu Lin, Ruiduan Li, Zhihao Li
TL;DR
ShuffleGate presents a unified sensitivity-learning framework that optimizes features across FS, DS, and embedding compression by measuring how much performance degrades when information is shuffled batch-wise. Through a differentiable gating mechanism and batch-wise noise, it achieves natural polarization of importance scores, eliminates the search–retrain gap via a WYSIWYG property, and scales to massive parameter spaces with minimal overhead. Empirically, it outperforms state-of-the-art baselines on FS and DS tasks, enables extreme embedding pruning (up to 99.9%) on Criteo with maintained AUC, and delivers substantial industrial gains, including a 91% increase in training throughput in a billion-scale video recommender. The work offers practical deployment guidance, showing scenarios where ShuffleGate reduces inference costs and training IO bottlenecks without sacrificing predictive performance, making it a robust solution for real-world large-scale recommender systems.
Abstract
Feature optimization, specifically Feature Selection (FS) and Dimension Selection (DS), is critical for the efficiency and generalization of large-scale recommender systems. While conceptually related, these tasks are typically tackled with isolated solutions that often suffer from ambiguous importance scores or prohibitive computational costs. In this paper, we propose ShuffleGate, a unified and interpretable mechanism that estimates component importance by measuring the model's sensitivity to information loss. Unlike conventional gating that learns relative weights, ShuffleGate introduces a batch-wise shuffling strategy to effectively erase information in an end-to-end differentiable manner. This paradigm shift yields naturally polarized importance distributions, bridging the long-standing "search-retrain gap" and distinguishing essential signals from noise without complex threshold tuning. ShuffleGate provides a unified solution across granularities. It achieves state-of-the-art performance on feature and dimension selection tasks. Furthermore, to demonstrate its extreme scalability and precision, we extend ShuffleGate to evaluate fine-grained embedding entries. Experiments show it can identify and prune 99.9% of redundant embedding parameters on the Criteo dataset while maintaining competitive AUC, verifying its robustness in massive search spaces. Finally, the method has been successfully deployed in a top-tier industrial video recommendation platform. By compressing the concatenated input dimension from over 10,000 to 1,000+, it achieved a 91% increase in training throughput while serving billions of daily requests without performance degradation.
