Learning Unified User Quantized Tokenizers for User Representation

Chuan He; Yang Chen; Wuliang Huang; Tianyi Zheng; Jianhu Chen; Bin Dou; Yice Luo; Yun Zhu; Baokun Wang; Yongchao Liu; Xing Fu; Yu Cheng; Chuntao Hong; Weiqiang Wang; Xin-Wei Yao; Zhongle Xie

Learning Unified User Quantized Tokenizers for User Representation

Chuan He, Yang Chen, Wuliang Huang, Tianyi Zheng, Jianhu Chen, Bin Dou, Yice Luo, Yun Zhu, Baokun Wang, Yongchao Liu, Xing Fu, Yu Cheng, Chuntao Hong, Weiqiang Wang, Xin-Wei Yao, Zhongle Xie

TL;DR

This work tackles the challenge of multi-source user representation by proposing U2QT, a two-stage framework that first maps heterogeneous user data into a unified language space using Qwen3 embeddings and then discretizes the representations with a Multi-view RQ-VAE that uses shared and source-specific codebooks to produce compact tokens. The approach achieves dramatic memory and compute savings (approximately 84× memory reduction and 3.5× faster training) while maintaining semantic fidelity, enabling scalable deployment in industrial settings. Across downstream tasks like future behavior prediction and recommendation, U2QT demonstrates strong cross-task generalization and outperforms task-specific baselines, with ablations highlighting the value of each data source and the benefits of the hierarchical codebook design. The framework’s design supports seamless integration with language models and offers practical impact for large-scale personalization in platforms such as Alipay.

Abstract

Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, we use the Qwen3 Embedding model to derive a compact yet expressive feature representation; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence. Experimental results showcase U2QT's advantages across diverse downstream tasks, outperforming task-specific baselines in future behavior prediction and recommendation tasks while achieving efficiency gains in storage and computation. The unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.

Learning Unified User Quantized Tokenizers for User Representation

TL;DR

Abstract

Learning Unified User Quantized Tokenizers for User Representation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)