Table of Contents
Fetching ...

CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration

Tianqi Liu, Kairui Fu, Shengyu Zhang, Wenyan Fan, Zhaocheng Du, Jieming Zhu, Fan Wu, Fei Wu

TL;DR

CHORD tackles the challenge of deploying personalized sequential recommender models on diverse devices by marrying device-specific channel-wise mixed-precision quantization with device-cloud collaboration. The cloud computes multi-level parameter saliency and encodes a 2-bit-per-channel strategy that a frozen backbone on-device applies in a single forward pass, enabling real-time inference with minimal communication. Across SASRec and Caser on three real-world datasets, CHORD surpasses strong baselines in accuracy while drastically reducing transmitted parameters and maintaining adaptive performance under varying resource conditions. This approach demonstrates a practical path to scalable, private, on-device personalization for sequential recommendations in heterogeneous environments.

Abstract

With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underline{\textbf{C}}ustomizing \underline{\textbf{H}}ybrid-precision \underline{\textbf{O}}n-device model for sequential \underline{\textbf{R}}ecommendation with \underline{\textbf{D}}evice-cloud collaboration (\textbf{CHORD}), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.

CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration

TL;DR

CHORD tackles the challenge of deploying personalized sequential recommender models on diverse devices by marrying device-specific channel-wise mixed-precision quantization with device-cloud collaboration. The cloud computes multi-level parameter saliency and encodes a 2-bit-per-channel strategy that a frozen backbone on-device applies in a single forward pass, enabling real-time inference with minimal communication. Across SASRec and Caser on three real-world datasets, CHORD surpasses strong baselines in accuracy while drastically reducing transmitted parameters and maintaining adaptive performance under varying resource conditions. This approach demonstrates a practical path to scalable, private, on-device personalization for sequential recommendations in heterogeneous environments.

Abstract

With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underline{\textbf{C}}ustomizing \underline{\textbf{H}}ybrid-precision \underline{\textbf{O}}n-device model for sequential \underline{\textbf{R}}ecommendation with \underline{\textbf{D}}evice-cloud collaboration (\textbf{CHORD}), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.

Paper Structure

This paper contains 28 sections, 13 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) Significant heterogeneity exists in interest patterns and computational resources between cloud and devices. (b) Devices use multiple recommendation models to capture diverse interaction scenarios, further emphasizing the importance of fast adaptation and inference. (c) Fine-tuned device models require costly retraining and backpropagation whenever user interests shift, forcing reliance on suboptimal models until these updates complete. (d) In our "CHORD" approach, the cloud generates personalized channel-wise quantization strategies encoded as 2-bit representations upon interest shifts. Devices then utilize these quantized random-initialized models for efficient single-pass inference with improved accuracy.
  • Figure 2: Overview of CHORD. (a) Devices will generate latent interest embeddings based on real-time interactions. (b) The cloud will discover filter-level and element-level relationships of parameters for each layer based on user profiles. (c) Another module on cloud will generate layer-level parameter sensitivity for each user. (d) The cloud will further utilize the element-level importance to reconstruct the filter-level importance. And then, the cloud will make a channel-wise quantization strategy based on the weighted filter-level importance and layer level importance. Transmission over the network only consists of 2-bit channel-wise strategy instead of weights. (e) Each device will share the same initial frozen weights. Devices will inference efficiently according to the customized mixed-precision quantization strategy with one forward pass.
  • Figure 3: Detailed Training Analysis Compared to four quantization baselines on ML-100K and Yelp
  • Figure 4: Sensitivity Analysis on Channel Selection Rate
  • Figure 5: Visualization of the personalized quantization strategy: The left subplot demonstrates the distribution of layers identified as most critical. The right subplot displays the average bit allocation per channel for users in the 0th layer.