Image Translation as Diffusion Visual Programmers

Cheng Han; James C. Liang; Qifan Wang; Majid Rabbani; Sohail Dianat; Raghuveer Rao; Ying Nian Wu; Dongfang Liu

Image Translation as Diffusion Visual Programmers

Cheng Han, James C. Liang, Qifan Wang, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Ying Nian Wu, Dongfang Liu

TL;DR

This work reframes image translation as a two-stage process combining a condition-flexible diffusion model with a GPT-driven planner to produce a sequence of visual programs for targeted RoI editing and translation. By decoupling high-dimensional concepts into low-dimensional symbols through in-context visual programming, DVP achieves context-free, local edits with improved explainability and controllability. The key innovations include instance normalization guidance to remove reliance on hand-tuned guidance scales, a neuro-symbolic planning framework with explicit intermediate symbols, and a modular pipeline that integrates off-the-shelf vision models with diffusion. Empirical results on a new 100-pair benchmark demonstrate superior fidelity and qualitative performance against multiple baselines, along with ablation studies validating the contributions and identifying limitations related to occlusions and challenging lighting conditions.

Abstract

We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent and controllable image translation processes. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts. This success can be attributed to several key features of DVP: First, DVP achieves condition-flexible translation via instance normalization, enabling the model to eliminate sensitivity caused by the manual guidance and optimally focus on textual descriptions for high-quality content generation. Second, the framework enhances in-context reasoning by deciphering intricate high-dimensional concepts in feature spaces into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]), allowing for localized, context-free editing while maintaining overall coherence. Last but not least, DVP improves systemic controllability and explainability by offering explicit symbolic representations at each programming stage, empowering users to intuitively interpret and modify results. Our research marks a substantial step towards harmonizing artificial image translation processes with cognitive intelligence, promising broader applications.

Image Translation as Diffusion Visual Programmers

TL;DR

Abstract

Paper Structure (24 sections, 5 equations, 20 figures, 4 tables, 3 algorithms)

This paper contains 24 sections, 5 equations, 20 figures, 4 tables, 3 algorithms.

Introduction
Related Work
Approach
Condition-flexible Diffusion Model
In-context Visual Programming
Experiments
Implementation Details
Comparisons with Current Methods
Systemic Diagnosis
Conclusion
Implementation Details and Pseudo-code of DVP
More qualitative results for instance normalization
More qualitative results for in-context reasoning
Video translation
General-context dataset
...and 9 more sections

Figures (20)

Figure 1: Working pipeline showcase. DVP represents a solution rooted in visual programming, demonstrating pronounced capabilities in-context reasoning and explainable control, in addition to its remarkable efficacy in style transfer.
Figure 2: Diffusion Visual Programmer (DVP) overview. Our proposed framework contains two core modules: is the condition-flexible diffusion model (see §\ref{['subsec:attention_ddim']}), augmented by the integration of instance normalization (see Fig. \ref{['fig:fig3']}), aimed to achieve a more generalized approach to translation; stands for visual programming (see §\ref{['subsec:in_context']}), fulfilled by a series of off-the-shelf operations (e.g., Segment operation for precise RoI segmentation). The overall neuro-symbolic design enables in-context reasoning for context-free editing. We also enjoy enhanced controllability and explainability by intuitively explicit symbols (e.g., [Prompt], [RoI object], [Scenario], [Translated object]) at each intermediate stage, facilitating human interpretation, comprehension and modification.
Figure 3: Instance Normalization Guidance.
Figure 4: Qualitative results with the state-of-the-art baselines. DVP exhibits rich capability in style transfer, achieving realistic quality while retaining high fidelity. Owing to the context-free manipulation (see §\ref{['subsec:in_context']}), the DVP framework is capable of flawlessly preserving the background scenes while specifically targeting the translation of the RoI. Note that while VISPROG also enables context-free editing, it exhibits considerable limitations in rational manipulation (see Fig. \ref{['fig:position']}).
Figure 5: Ablative visualization results of instance normalization compared with various guidance scales $w$.
...and 15 more figures

Image Translation as Diffusion Visual Programmers

TL;DR

Abstract

Image Translation as Diffusion Visual Programmers

Authors

TL;DR

Abstract

Table of Contents

Figures (20)