Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision
Yuandong Pu, Le Zhuo, Kaiwen Zhu, Liangbin Xie, Wenlong Zhang, Xiangyu Chen, Peng Gao, Yu Qiao, Chao Dong, Yihao Liu
TL;DR
OmniLV introduces a unified multimodal framework for low-level vision that handles over 100 tasks across restoration, enhancement, dense prediction, and stylization by using separate text and visual prompts conditioned on a diffusion-prior backbone. The model emphasizes fidelity and arbitrary resolutions, aided by a large 40M-instance OmniLV dataset and three-stage training that includes in-context learning. Key findings show that separate multimodal encoding and early conditioning improve multi-task generalization, while injecting high-level generative tasks can harm detail-sensitive restoration. Empirical results demonstrate strong performance across tasks and real-world robustness, with insights into prompt design and task-interference, highlighting OmniLV’s potential as a generalist tool for low-level vision.
Abstract
We present Lunima-OmniLV (abbreviated as OmniLV), a universal multimodal multi-task framework for low-level vision that addresses over 100 sub-tasks across four major categories: image restoration, image enhancement, weak-semantic dense prediction, and stylization. OmniLV leverages both textual and visual prompts to offer flexible and user-friendly interactions. Built on Diffusion Transformer (DiT)-based generative priors, our framework supports arbitrary resolutions -- achieving optimal performance at 1K resolution -- while preserving fine-grained details and high fidelity. Through extensive experiments, we demonstrate that separately encoding text and visual instructions, combined with co-training using shallow feature control, is essential to mitigate task ambiguity and enhance multi-task generalization. Our findings also reveal that integrating high-level generative tasks into low-level vision models can compromise detail-sensitive restoration. These insights pave the way for more robust and generalizable low-level vision systems.
