Learning Visual Generative Priors without Text

Shuailei Ma; Kecheng Zheng; Ying Wei; Wei Wu; Fan Lu; Yifei Zhang; Chen-Wei Xie; Biao Gong; Jiapeng Zhu; Yujun Shen

Learning Visual Generative Priors without Text

Shuailei Ma, Kecheng Zheng, Ying Wei, Wei Wu, Fan Lu, Yifei Zhang, Chen-Wei Xie, Biao Gong, Jiapeng Zhu, Yujun Shen

TL;DR

This work introduces Lumos, a pure-vision image-to-image framework that learns visual generative priors from in-the-wild images without text supervision. By extracting rich visual semantics with pre-trained vision encoders and training a latent diffusion model conditioned on these features, Lumos establishes a scalable I2I prior that can be transferred to downstream tasks with limited text data. The authors demonstrate that I2I priors can match or surpass traditional text-to-image priors on text-conditioned generation and outperform T2I priors on text-irrelevant tasks such as novel view synthesis and image-to-video generation. This approach reduces dependence on costly text–image pairs, enabling scalable texture modeling and broad applicability across diverse visual generation tasks with practical impact for content creation and multimodal understanding.

Abstract

Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video. Our project page is available at https://ant-research.github.io/lumos.

Learning Visual Generative Priors without Text

TL;DR

Abstract

Learning Visual Generative Priors without Text

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)