Table of Contents
Fetching ...

UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens

Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, Bocheng Zou, Chaoqun Yang, Wentao Zhang

TL;DR

<3-5 sentence high-level summary> UniCTokens tackles the mismatch between personalized understanding and generation by introducing unified concept tokens that support both tasks within a single vision-language framework. A three-stage progressive training (understanding warm-up, bootstrap generation from understanding, and deepening understanding from generation) enables cross-task transfer and mutual enhancement, evaluated by UnifyBench. The results show competitive performance in concept understanding and generation and state-of-the-art performance in personalized attribute-reasoning generation, demonstrating that better understanding improves generation and that generation can further refine understanding. This work advances unified personalization with efficient data usage and provides a foundation for future cross-task VLM personalization strategies.

Abstract

Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation \textit{\textbf{personalized attribute-reasoning generation}}. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and attribute-reasoning generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized attribute-reasoning generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}.

UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens

TL;DR

<3-5 sentence high-level summary> UniCTokens tackles the mismatch between personalized understanding and generation by introducing unified concept tokens that support both tasks within a single vision-language framework. A three-stage progressive training (understanding warm-up, bootstrap generation from understanding, and deepening understanding from generation) enables cross-task transfer and mutual enhancement, evaluated by UnifyBench. The results show competitive performance in concept understanding and generation and state-of-the-art performance in personalized attribute-reasoning generation, demonstrating that better understanding improves generation and that generation can further refine understanding. This work advances unified personalization with efficient data usage and provides a foundation for future cross-task VLM personalization strategies.

Abstract

Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept , generating " wearing its hat" without additional textual descriptions of its hat. We call this kind of generation \textit{\textbf{personalized attribute-reasoning generation}}. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and attribute-reasoning generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized attribute-reasoning generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}.

Paper Structure

This paper contains 41 sections, 7 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: The capability overview of UniCTokens. UniCTokens achieves personalized understanding and generation of a unified VLM using user-provided concept images and texts. This is accomplished by fine-tuning a set of unified concept tokens, which harness the mutual benefits of understanding and generation. Notably, UniCTokens supports complex personalized attribute-reasoning generation, which has never been achieved by previous methods.
  • Figure 2: The overview of UniCTokens. Rather than training separate concept tokens for understanding and generation, we train unified concept tokens that take advantage of the mutual benefits of both tasks. Linked with shared tokens, we achieve cross-task transfer.
  • Figure 3: Overview of Training Stages of UniCTokens.
  • Figure 4: Generation as Perception. The first row depicts the generation process, while the differences at different timestamps capture concept details (e.g., pig noses and cup handles).
  • Figure 5: Qualitative Comparisons among UniCTokens, Yo'Chameleon and GPT-4o. Our proposed UniCTokens demonstrates its controllable and personalized generation.
  • ...and 6 more figures