Table of Contents
Fetching ...

ShuffleGate: Scalable Feature Optimization for Recommender Systems via Batch-wise Sensitivity Learning

Yihong Huang, Chen Chu, Fan Zhang, Liping Wang Fei Chen, Yu Lin, Ruiduan Li, Zhihao Li

TL;DR

ShuffleGate presents a unified sensitivity-learning framework that optimizes features across FS, DS, and embedding compression by measuring how much performance degrades when information is shuffled batch-wise. Through a differentiable gating mechanism and batch-wise noise, it achieves natural polarization of importance scores, eliminates the search–retrain gap via a WYSIWYG property, and scales to massive parameter spaces with minimal overhead. Empirically, it outperforms state-of-the-art baselines on FS and DS tasks, enables extreme embedding pruning (up to 99.9%) on Criteo with maintained AUC, and delivers substantial industrial gains, including a 91% increase in training throughput in a billion-scale video recommender. The work offers practical deployment guidance, showing scenarios where ShuffleGate reduces inference costs and training IO bottlenecks without sacrificing predictive performance, making it a robust solution for real-world large-scale recommender systems.

Abstract

Feature optimization, specifically Feature Selection (FS) and Dimension Selection (DS), is critical for the efficiency and generalization of large-scale recommender systems. While conceptually related, these tasks are typically tackled with isolated solutions that often suffer from ambiguous importance scores or prohibitive computational costs. In this paper, we propose ShuffleGate, a unified and interpretable mechanism that estimates component importance by measuring the model's sensitivity to information loss. Unlike conventional gating that learns relative weights, ShuffleGate introduces a batch-wise shuffling strategy to effectively erase information in an end-to-end differentiable manner. This paradigm shift yields naturally polarized importance distributions, bridging the long-standing "search-retrain gap" and distinguishing essential signals from noise without complex threshold tuning. ShuffleGate provides a unified solution across granularities. It achieves state-of-the-art performance on feature and dimension selection tasks. Furthermore, to demonstrate its extreme scalability and precision, we extend ShuffleGate to evaluate fine-grained embedding entries. Experiments show it can identify and prune 99.9% of redundant embedding parameters on the Criteo dataset while maintaining competitive AUC, verifying its robustness in massive search spaces. Finally, the method has been successfully deployed in a top-tier industrial video recommendation platform. By compressing the concatenated input dimension from over 10,000 to 1,000+, it achieved a 91% increase in training throughput while serving billions of daily requests without performance degradation.

ShuffleGate: Scalable Feature Optimization for Recommender Systems via Batch-wise Sensitivity Learning

TL;DR

ShuffleGate presents a unified sensitivity-learning framework that optimizes features across FS, DS, and embedding compression by measuring how much performance degrades when information is shuffled batch-wise. Through a differentiable gating mechanism and batch-wise noise, it achieves natural polarization of importance scores, eliminates the search–retrain gap via a WYSIWYG property, and scales to massive parameter spaces with minimal overhead. Empirically, it outperforms state-of-the-art baselines on FS and DS tasks, enables extreme embedding pruning (up to 99.9%) on Criteo with maintained AUC, and delivers substantial industrial gains, including a 91% increase in training throughput in a billion-scale video recommender. The work offers practical deployment guidance, showing scenarios where ShuffleGate reduces inference costs and training IO bottlenecks without sacrificing predictive performance, making it a robust solution for real-world large-scale recommender systems.

Abstract

Feature optimization, specifically Feature Selection (FS) and Dimension Selection (DS), is critical for the efficiency and generalization of large-scale recommender systems. While conceptually related, these tasks are typically tackled with isolated solutions that often suffer from ambiguous importance scores or prohibitive computational costs. In this paper, we propose ShuffleGate, a unified and interpretable mechanism that estimates component importance by measuring the model's sensitivity to information loss. Unlike conventional gating that learns relative weights, ShuffleGate introduces a batch-wise shuffling strategy to effectively erase information in an end-to-end differentiable manner. This paradigm shift yields naturally polarized importance distributions, bridging the long-standing "search-retrain gap" and distinguishing essential signals from noise without complex threshold tuning. ShuffleGate provides a unified solution across granularities. It achieves state-of-the-art performance on feature and dimension selection tasks. Furthermore, to demonstrate its extreme scalability and precision, we extend ShuffleGate to evaluate fine-grained embedding entries. Experiments show it can identify and prune 99.9% of redundant embedding parameters on the Criteo dataset while maintaining competitive AUC, verifying its robustness in massive search spaces. Finally, the method has been successfully deployed in a top-tier industrial video recommendation platform. By compressing the concatenated input dimension from over 10,000 to 1,000+, it achieved a 91% increase in training throughput while serving billions of daily requests without performance degradation.

Paper Structure

This paper contains 37 sections, 2 theorems, 8 equations, 5 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.2

For an $\epsilon$-non-predictive feature, if the regularization coefficient satisfies $\alpha > \epsilon \Delta_i$, then the total gradient is strictly positive, driving $g_i$ to 0.

Figures (5)

  • Figure 1: Importance score distributions from AutoField autofield (left) and ShuffleGate (right) for feature selection task.
  • Figure 2: Example of Batch-wise Shuffle Operation on Feature-Field Level. Unlike global permutation, ShuffleGate permutes feature fields independently within a mini-batch.
  • Figure 3: The WYSIWYG Property. The AUC during the gate learning phase (Gate Learning AUC) exhibits a strong correlation with the AUC after pruning (Retrain AUC). This allows for reliable performance estimation without retraining.
  • Figure 4: Search Time Efficiency on Criteo. ShuffleGate achieves a 15$\times$ speedup over SHARK on feature selection (39 features). More importantly, its time cost remains constant ($O(1)$) even when scaling to 270 million embedding entries (ShuffleGate-EC), whereas SHARK would be computationally infeasible.
  • Figure 5: Visualization of Polarization. ShuffleGate learns a highly polarized distribution. Strong signals (blue) are kept high, while noise and redundant features (grey) are suppressed to near-zero, enabling a clear cut-off at the 0.5 threshold.

Theorems & Definitions (3)

  • Definition 4.1: $\epsilon$-Non-Predictive Feature
  • Theorem 4.2: Noise Suppression
  • Theorem 4.3: Signal Preservation