StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation
Yi Wu, Lingting Zhu, Shengju Qian, Lei Liu, Wandi Qiao, Lequan Yu, Bin Li
TL;DR
StyleAR tackles the data bottleneck in style-aligned text-to-image generation for multimodal autoregressive models by learning from binary text-image data, avoiding triplet data collection. It introduces a data-curation pipeline that discards the reference style image, mixes stylized and raw images, and leverages CLIP-based style tokens through a perceiver resampler, along with style-enhanced tokens and SAM-based inference; DPO is used for post-training. The method achieves strong prompt adherence and superior style consistency compared to diffusion baselines, as shown by quantitative metrics and user studies, while preserving AR advantages and enabling conditional controls like depth maps. This work broadens the applicability of AR models in stylized image generation and offers scalable data-efficient strategies for future multimodal systems.
Abstract
In the current research landscape, multimodal autoregressive (AR) models have shown exceptional capabilities across various domains, including visual understanding and generation. However, complex tasks such as style-aligned text-to-image generation present significant challenges, particularly in data acquisition. In analogy to instruction-following tuning for image editing of AR models, style-aligned generation requires a reference style image and prompt, resulting in a text-image-to-image triplet where the output shares the style and semantics of the input. However, acquiring large volumes of such triplet data with specific styles is considerably more challenging than obtaining conventional text-to-image data used for training generative models. To address this issue, we propose StyleAR, an innovative approach that combines a specially designed data curation method with our proposed AR models to effectively utilize text-to-image binary data for style-aligned text-to-image generation. Our method synthesizes target stylized data using a reference style image and prompt, but only incorporates the target stylized image as the image modality to create high-quality binary data. To facilitate binary data training, we introduce a CLIP image encoder with a perceiver resampler that translates the image input into style tokens aligned with multimodal tokens in AR models and implement a style-enhanced token technique to prevent content leakage which is a common issue in previous work. Furthermore, we mix raw images drawn from large-scale text-image datasets with stylized images to enhance StyleAR's ability to extract richer stylistic features and ensure style consistency. Extensive qualitative and quantitative experiments demonstrate our superior performance.
