Table of Contents
Fetching ...

StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

Yi Wu, Lingting Zhu, Shengju Qian, Lei Liu, Wandi Qiao, Lequan Yu, Bin Li

TL;DR

StyleAR tackles the data bottleneck in style-aligned text-to-image generation for multimodal autoregressive models by learning from binary text-image data, avoiding triplet data collection. It introduces a data-curation pipeline that discards the reference style image, mixes stylized and raw images, and leverages CLIP-based style tokens through a perceiver resampler, along with style-enhanced tokens and SAM-based inference; DPO is used for post-training. The method achieves strong prompt adherence and superior style consistency compared to diffusion baselines, as shown by quantitative metrics and user studies, while preserving AR advantages and enabling conditional controls like depth maps. This work broadens the applicability of AR models in stylized image generation and offers scalable data-efficient strategies for future multimodal systems.

Abstract

In the current research landscape, multimodal autoregressive (AR) models have shown exceptional capabilities across various domains, including visual understanding and generation. However, complex tasks such as style-aligned text-to-image generation present significant challenges, particularly in data acquisition. In analogy to instruction-following tuning for image editing of AR models, style-aligned generation requires a reference style image and prompt, resulting in a text-image-to-image triplet where the output shares the style and semantics of the input. However, acquiring large volumes of such triplet data with specific styles is considerably more challenging than obtaining conventional text-to-image data used for training generative models. To address this issue, we propose StyleAR, an innovative approach that combines a specially designed data curation method with our proposed AR models to effectively utilize text-to-image binary data for style-aligned text-to-image generation. Our method synthesizes target stylized data using a reference style image and prompt, but only incorporates the target stylized image as the image modality to create high-quality binary data. To facilitate binary data training, we introduce a CLIP image encoder with a perceiver resampler that translates the image input into style tokens aligned with multimodal tokens in AR models and implement a style-enhanced token technique to prevent content leakage which is a common issue in previous work. Furthermore, we mix raw images drawn from large-scale text-image datasets with stylized images to enhance StyleAR's ability to extract richer stylistic features and ensure style consistency. Extensive qualitative and quantitative experiments demonstrate our superior performance.

StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

TL;DR

StyleAR tackles the data bottleneck in style-aligned text-to-image generation for multimodal autoregressive models by learning from binary text-image data, avoiding triplet data collection. It introduces a data-curation pipeline that discards the reference style image, mixes stylized and raw images, and leverages CLIP-based style tokens through a perceiver resampler, along with style-enhanced tokens and SAM-based inference; DPO is used for post-training. The method achieves strong prompt adherence and superior style consistency compared to diffusion baselines, as shown by quantitative metrics and user studies, while preserving AR advantages and enabling conditional controls like depth maps. This work broadens the applicability of AR models in stylized image generation and offers scalable data-efficient strategies for future multimodal systems.

Abstract

In the current research landscape, multimodal autoregressive (AR) models have shown exceptional capabilities across various domains, including visual understanding and generation. However, complex tasks such as style-aligned text-to-image generation present significant challenges, particularly in data acquisition. In analogy to instruction-following tuning for image editing of AR models, style-aligned generation requires a reference style image and prompt, resulting in a text-image-to-image triplet where the output shares the style and semantics of the input. However, acquiring large volumes of such triplet data with specific styles is considerably more challenging than obtaining conventional text-to-image data used for training generative models. To address this issue, we propose StyleAR, an innovative approach that combines a specially designed data curation method with our proposed AR models to effectively utilize text-to-image binary data for style-aligned text-to-image generation. Our method synthesizes target stylized data using a reference style image and prompt, but only incorporates the target stylized image as the image modality to create high-quality binary data. To facilitate binary data training, we introduce a CLIP image encoder with a perceiver resampler that translates the image input into style tokens aligned with multimodal tokens in AR models and implement a style-enhanced token technique to prevent content leakage which is a common issue in previous work. Furthermore, we mix raw images drawn from large-scale text-image datasets with stylized images to enhance StyleAR's ability to extract richer stylistic features and ensure style consistency. Extensive qualitative and quantitative experiments demonstrate our superior performance.

Paper Structure

This paper contains 14 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Stylized samples of our StyleAR. Our StyleAR is capable of generating images that are highly consistent in style with the reference images across a diverse range of styles, and highly aligned in semantics with the input prompts of various categories.
  • Figure 2: The pipeline of our method.a) We first investigate a novel stylized image data curation to form binary data with high prompt adherence and prevent low style consistency. b) We use a mixed dataset to enhance rich stylistic features learning. c) With the designed data curation and model framework, our method achieve high prompt adherence and style consistency.
  • Figure 3: The framework of our StyleAR. During training, we utilize a frozen CLIP radford2021learning image encoder along with a trainable perceiver jaegle2021perceiveralayrac2022flamingo resampler module to efficiently extracted features. Subsequently, style tokens are combined with the injected Gaussian noise and concatenated with multimodal tokens by replacing the placeholder tokens. During inference, we incorporate SAM kirillov2023segment to remove irrelevant semantic contents in the reference style image.
  • Figure 4: Qualitative comparison. We conducted a comprehensive qualitative evaluation by comparing our StyleAR with various existing methods which are all diffusion-based, including InstantStyle wang2024instantstyle, IP-Adapter ye2023ip, StyleAligned hertz2024style, StyleCrafter liu2023stylecrafter, StyleShot gao2024styleshot.
  • Figure 5: User study. We conducted a user study by comparing StyleAR with existing methods, including InstantStyle wang2024instantstyle, IP-Adapter ye2023ip, StyleAligned hertz2024style, StyleCrafter liu2023stylecrafter, StyleShot gao2024styleshot.
  • ...and 3 more figures