Table of Contents
Fetching ...

Learning Visual Generative Priors without Text

Shuailei Ma, Kecheng Zheng, Ying Wei, Wei Wu, Fan Lu, Yifei Zhang, Chen-Wei Xie, Biao Gong, Jiapeng Zhu, Yujun Shen

TL;DR

This work introduces Lumos, a pure-vision image-to-image framework that learns visual generative priors from in-the-wild images without text supervision. By extracting rich visual semantics with pre-trained vision encoders and training a latent diffusion model conditioned on these features, Lumos establishes a scalable I2I prior that can be transferred to downstream tasks with limited text data. The authors demonstrate that I2I priors can match or surpass traditional text-to-image priors on text-conditioned generation and outperform T2I priors on text-irrelevant tasks such as novel view synthesis and image-to-video generation. This approach reduces dependence on costly text–image pairs, enabling scalable texture modeling and broad applicability across diverse visual generation tasks with practical impact for content creation and multimodal understanding.

Abstract

Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video. Our project page is available at https://ant-research.github.io/lumos.

Learning Visual Generative Priors without Text

TL;DR

This work introduces Lumos, a pure-vision image-to-image framework that learns visual generative priors from in-the-wild images without text supervision. By extracting rich visual semantics with pre-trained vision encoders and training a latent diffusion model conditioned on these features, Lumos establishes a scalable I2I prior that can be transferred to downstream tasks with limited text data. The authors demonstrate that I2I priors can match or surpass traditional text-to-image priors on text-conditioned generation and outperform T2I priors on text-irrelevant tasks such as novel view synthesis and image-to-video generation. This approach reduces dependence on costly text–image pairs, enabling scalable texture modeling and broad applicability across diverse visual generation tasks with practical impact for content creation and multimodal understanding.

Abstract

Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video. Our project page is available at https://ant-research.github.io/lumos.

Paper Structure

This paper contains 35 sections, 4 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Diverse downstream tasks of Lumos including (a) text-to-image generation, (b) novel view synthesis (left: input view, middle: random novel views, right: reconstruction Gaussian) and (c) image-to-video generation.
  • Figure 2: Various image generation tasks can be improved by our image-to-image priors. I2I prior enables the downstream T2I model to decrease dependence on high-quality data, and with data scaling up in I2I, it enjoys a larger performance improvement. We also adopt a pure vision-based I2I generation (i.e., I2I prior with DINO) that is a late bloomer for T2I generation. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant vision tasks, like I2V and NVS.
  • Figure 3: Overall architecture of our framework. (a) Image-to-Image Generation, (b) Text-to-Image Generation, (c) Novel View Synthesis and (d) Image-to-Video Generation.
  • Figure 4: Comparison with different priors for novel view synthesis. I2I prior shows better metrics consistently from the start of fine-tuning.
  • Figure 5: Samples produced by Lumos-T2I exhibit exceptional quality, characterized by a remarkable level of fidelity and precision in adhering to the provided textual prompts.
  • ...and 14 more figures