Table of Contents
Fetching ...

Instruction-aware User Embedding via Synergistic Language and Representation Modeling

Ziyi Gao, Yike Xu, Jiahao Yuan, Baokun Wang, Jinyong Wen, Xiaotong Lin, Yun Liu, Xing Fu, Yu Cheng, Yongchao Liu, Weiqiang Wang, Zhongle Xie

TL;DR

This work tackles the rigidity and domain limitations of traditional user representations by introducing InstructUE, an instruction aware embedding framework built on large language models. It combines six modality specific encoders with lightweight adapters, and employs a contrastive autoregressive training regime to bridge language and representation spaces using a synthetic UserQA dataset. A key innovation is instruction guided dynamic embedding generation, which can adapt embeddings to downstream tasks via natural language directives and few shot instruction tuning. Experiments across six real world tasks demonstrate strong generalization and robustness, showing that instruction quality substantially impacts downstream performance and that the proposed training framework yields superior, adaptable user representations for personalized marketing, prediction, and recommendation tasks.

Abstract

User representation modeling has become increasingly crucial for personalized applications, yet existing approaches struggle with generalizability across domains and sensitivity to noisy behavioral signals. We present InstructUE, an instruction-aware user embedding foundation model that leverages large language models (LLMs) to generate general and instruction-aware user representations. InstructUE introduces a multi-encoder architecture with a lightweight adapter that efficiently processes heterogeneous data from six different sources while preserving their structural characteristics. Additionally, it proposes a novel contrastive-autoregressive training framework that bridges language and representation spaces through a curated UserQA dataset. The contrastive-autoregressive training framework simultaneously leverages autoregressive learning to capture domain knowledge in language space and contrastive learning to align user-text embeddings in representation space, thereby enhancing the instruction-awareness and noise-robustness of user embeddings. Through extensive experiments on real-world applications, we demonstrate that InstructUE significantly outperforms existing methods across multiple domains including user prediction, marketing, and recommendation scenarios. Our results show that instruction-aware user modeling can effectively achieve instruction-guided denoising of user information in specific scenarios, paving the way for more generalizable and robust user representation learning.

Instruction-aware User Embedding via Synergistic Language and Representation Modeling

TL;DR

This work tackles the rigidity and domain limitations of traditional user representations by introducing InstructUE, an instruction aware embedding framework built on large language models. It combines six modality specific encoders with lightweight adapters, and employs a contrastive autoregressive training regime to bridge language and representation spaces using a synthetic UserQA dataset. A key innovation is instruction guided dynamic embedding generation, which can adapt embeddings to downstream tasks via natural language directives and few shot instruction tuning. Experiments across six real world tasks demonstrate strong generalization and robustness, showing that instruction quality substantially impacts downstream performance and that the proposed training framework yields superior, adaptable user representations for personalized marketing, prediction, and recommendation tasks.

Abstract

User representation modeling has become increasingly crucial for personalized applications, yet existing approaches struggle with generalizability across domains and sensitivity to noisy behavioral signals. We present InstructUE, an instruction-aware user embedding foundation model that leverages large language models (LLMs) to generate general and instruction-aware user representations. InstructUE introduces a multi-encoder architecture with a lightweight adapter that efficiently processes heterogeneous data from six different sources while preserving their structural characteristics. Additionally, it proposes a novel contrastive-autoregressive training framework that bridges language and representation spaces through a curated UserQA dataset. The contrastive-autoregressive training framework simultaneously leverages autoregressive learning to capture domain knowledge in language space and contrastive learning to align user-text embeddings in representation space, thereby enhancing the instruction-awareness and noise-robustness of user embeddings. Through extensive experiments on real-world applications, we demonstrate that InstructUE significantly outperforms existing methods across multiple domains including user prediction, marketing, and recommendation scenarios. Our results show that instruction-aware user modeling can effectively achieve instruction-guided denoising of user information in specific scenarios, paving the way for more generalizable and robust user representation learning.

Paper Structure

This paper contains 26 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison between (A) General User Embedding dou2025transferable and (B) our InstructUE. (A) learns transferable user representations across domains but generates fixed embeddings regardless of downstream context. (B) extends (A) with instruction-aware modulation, enabling a single model to produce adaptive, domain-specific embeddings via natural language instructions.
  • Figure 2: Overview of the framework. Multi-source user data is formatted into natural language sequences with modality delimiters. Instructions guide the generation process, and the <USER> token extracts the final embedding.
  • Figure 3: Input format of UserQA. Each modality is wrapped in semantic delimiters (shown in blue). An optional instruction is followed by the special <USER> token, which signals the model to extract a unified user embedding.
  • Figure 4: Training and inference pipeline of InstructUE. User representations are learned via a contrastive-autoregressive joint strategy, while instruction quality is enhanced through cluster-based supervised tuning of learnable instructions.
  • Figure 5: Visualization of universal and instruction-aware representation of six senarios.