Table of Contents
Fetching ...

Learning Unified User Quantized Tokenizers for User Representation

Chuan He, Yang Chen, Wuliang Huang, Tianyi Zheng, Jianhu Chen, Bin Dou, Yice Luo, Yun Zhu, Baokun Wang, Yongchao Liu, Xing Fu, Yu Cheng, Chuntao Hong, Weiqiang Wang, Xin-Wei Yao, Zhongle Xie

TL;DR

This work tackles the challenge of multi-source user representation by proposing U2QT, a two-stage framework that first maps heterogeneous user data into a unified language space using Qwen3 embeddings and then discretizes the representations with a Multi-view RQ-VAE that uses shared and source-specific codebooks to produce compact tokens. The approach achieves dramatic memory and compute savings (approximately 84× memory reduction and 3.5× faster training) while maintaining semantic fidelity, enabling scalable deployment in industrial settings. Across downstream tasks like future behavior prediction and recommendation, U2QT demonstrates strong cross-task generalization and outperforms task-specific baselines, with ablations highlighting the value of each data source and the benefits of the hierarchical codebook design. The framework’s design supports seamless integration with language models and offers practical impact for large-scale personalization in platforms such as Alipay.

Abstract

Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, we use the Qwen3 Embedding model to derive a compact yet expressive feature representation; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence. Experimental results showcase U2QT's advantages across diverse downstream tasks, outperforming task-specific baselines in future behavior prediction and recommendation tasks while achieving efficiency gains in storage and computation. The unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.

Learning Unified User Quantized Tokenizers for User Representation

TL;DR

This work tackles the challenge of multi-source user representation by proposing U2QT, a two-stage framework that first maps heterogeneous user data into a unified language space using Qwen3 embeddings and then discretizes the representations with a Multi-view RQ-VAE that uses shared and source-specific codebooks to produce compact tokens. The approach achieves dramatic memory and compute savings (approximately 84× memory reduction and 3.5× faster training) while maintaining semantic fidelity, enabling scalable deployment in industrial settings. Across downstream tasks like future behavior prediction and recommendation, U2QT demonstrates strong cross-task generalization and outperforms task-specific baselines, with ablations highlighting the value of each data source and the benefits of the hierarchical codebook design. The framework’s design supports seamless integration with language models and offers practical impact for large-scale personalization in platforms such as Alipay.

Abstract

Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, we use the Qwen3 Embedding model to derive a compact yet expressive feature representation; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence. Experimental results showcase U2QT's advantages across diverse downstream tasks, outperforming task-specific baselines in future behavior prediction and recommendation tasks while achieving efficiency gains in storage and computation. The unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.

Paper Structure

This paper contains 31 sections, 14 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: This figure shows the difference between pre-merge and post-merge. Our proposed method encodes a pre-merged multi-source feature through a unified codebook. On the contrary, previous works adopt a post-merge framework to simply concatenate the multi-source feature without cross-feature interactions.
  • Figure 2: This figure shows the advantage of our proposed method which contributes to improving data utilization during both training and inference phase and alleviating prohibitive storage demands.
  • Figure 3: Overview of Unified User Tokenizer. First, we utilize Qwen3 embedding model to encode the long context feature into compact yet expressive embeddings. Then, we propose a MRQ-VAE with a shared-specific codebook hierarchy to further compress the data into discrete embedding. Finally, we reconstruct Qwen3 embeddng of multi-source data through source-specific MLP decoder.
  • Figure 4: Align user tokenizer with future behavior description. The user tokenizer can also be enhanced by aligning with future behavior to entitle it with the ability of future behavior prediction.
  • Figure 5: This 3D bar shows the utilization of each specific codebook at different layers
  • ...and 2 more figures