Table of Contents
Fetching ...

Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision

Yuandong Pu, Le Zhuo, Kaiwen Zhu, Liangbin Xie, Wenlong Zhang, Xiangyu Chen, Peng Gao, Yu Qiao, Chao Dong, Yihao Liu

TL;DR

OmniLV introduces a unified multimodal framework for low-level vision that handles over 100 tasks across restoration, enhancement, dense prediction, and stylization by using separate text and visual prompts conditioned on a diffusion-prior backbone. The model emphasizes fidelity and arbitrary resolutions, aided by a large 40M-instance OmniLV dataset and three-stage training that includes in-context learning. Key findings show that separate multimodal encoding and early conditioning improve multi-task generalization, while injecting high-level generative tasks can harm detail-sensitive restoration. Empirical results demonstrate strong performance across tasks and real-world robustness, with insights into prompt design and task-interference, highlighting OmniLV’s potential as a generalist tool for low-level vision.

Abstract

We present Lunima-OmniLV (abbreviated as OmniLV), a universal multimodal multi-task framework for low-level vision that addresses over 100 sub-tasks across four major categories: image restoration, image enhancement, weak-semantic dense prediction, and stylization. OmniLV leverages both textual and visual prompts to offer flexible and user-friendly interactions. Built on Diffusion Transformer (DiT)-based generative priors, our framework supports arbitrary resolutions -- achieving optimal performance at 1K resolution -- while preserving fine-grained details and high fidelity. Through extensive experiments, we demonstrate that separately encoding text and visual instructions, combined with co-training using shallow feature control, is essential to mitigate task ambiguity and enhance multi-task generalization. Our findings also reveal that integrating high-level generative tasks into low-level vision models can compromise detail-sensitive restoration. These insights pave the way for more robust and generalizable low-level vision systems.

Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision

TL;DR

OmniLV introduces a unified multimodal framework for low-level vision that handles over 100 tasks across restoration, enhancement, dense prediction, and stylization by using separate text and visual prompts conditioned on a diffusion-prior backbone. The model emphasizes fidelity and arbitrary resolutions, aided by a large 40M-instance OmniLV dataset and three-stage training that includes in-context learning. Key findings show that separate multimodal encoding and early conditioning improve multi-task generalization, while injecting high-level generative tasks can harm detail-sensitive restoration. Empirical results demonstrate strong performance across tasks and real-world robustness, with insights into prompt design and task-interference, highlighting OmniLV’s potential as a generalist tool for low-level vision.

Abstract

We present Lunima-OmniLV (abbreviated as OmniLV), a universal multimodal multi-task framework for low-level vision that addresses over 100 sub-tasks across four major categories: image restoration, image enhancement, weak-semantic dense prediction, and stylization. OmniLV leverages both textual and visual prompts to offer flexible and user-friendly interactions. Built on Diffusion Transformer (DiT)-based generative priors, our framework supports arbitrary resolutions -- achieving optimal performance at 1K resolution -- while preserving fine-grained details and high fidelity. Through extensive experiments, we demonstrate that separately encoding text and visual instructions, combined with co-training using shallow feature control, is essential to mitigate task ambiguity and enhance multi-task generalization. Our findings also reveal that integrating high-level generative tasks into low-level vision models can compromise detail-sensitive restoration. These insights pave the way for more robust and generalizable low-level vision systems.

Paper Structure

This paper contains 23 sections, 4 equations, 29 figures, 7 tables.

Figures (29)

  • Figure 1: Illustration of OmniLV's versatile capabilities. As a universal framework, OmniLV is capable of handling a wide variety of low-level vision tasks within a single model, which adapts to diverse input-output domains and generates high-fidelity results.
  • Figure 2: Comparison between MLLM guided and LLM guided framework.
  • Figure 3: t-SNE visualization of the feature space of LLM and MLLM. Each dot represents a task instruction.
  • Figure 4: Illustration of five different variants to inject condition.
  • Figure 5: Overall framework of OmniLV. First, input images are encoded into latent space by VAE encoder. Then, we patchify the image latent and noise latent into visual tokens. Optionally, in-context pairs can be added to visual tokens to handle complex scenarios. At the same time, the instruction prompt and description prompt are processed by Gemma2B. Finally, we decode the denoised results to get the desired output images.
  • ...and 24 more figures