In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation

Han Xue; Qianru Sun; Li Song; Wenjun Zhang; Zhiwu Huang

In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation

Han Xue, Qianru Sun, Li Song, Wenjun Zhang, Zhiwu Huang

TL;DR

In-Context Translation (ICT), a general learning framework to unify visual recognition, low-level image processing, and conditional image generation, and edge-to-image synthesis, is proposed.

Abstract

We propose In-Context Translation (ICT), a general learning framework to unify visual recognition (e.g., semantic segmentation), low-level image processing (e.g., denoising), and conditional image generation (e.g., edge-to-image synthesis). Thanks to unification, ICT significantly reduces the inherent inductive bias that comes with designing models for specific tasks, and it maximizes mutual enhancement across similar tasks. However, the unification across a large number of tasks is non-trivial due to various data formats and training pipelines. To this end, ICT introduces two designs. Firstly, it standardizes input-output data of different tasks into RGB image pairs, e.g., semantic segmentation data pairs an RGB image with its segmentation mask in the same RGB format. This turns different tasks into a general translation task between two RGB images. Secondly, it standardizes the training of different tasks into a general in-context learning, where "in-context" means the input comprises an example input-output pair of the target task and a query image. The learning objective is to generate the "missing" data paired with the query. The implicit translation process is thus between the query and the generated image. In experiments, ICT unifies ten vision tasks and showcases impressive performance on their respective benchmarks. Notably, ICT performs well across three major categories of computer vision tasks, while its two competitors (Painter and PromptDiffusion) are only effective in at most two of these task categories. In addition, compared to its competitors, ICT trained on only 4 RTX 3090 GPUs is shown to be more efficient and less costly in training.

In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation

TL;DR

In-Context Translation (ICT), a general learning framework to unify visual recognition, low-level image processing, and conditional image generation, and edge-to-image synthesis, is proposed.

Abstract

Paper Structure (16 sections, 2 equations, 7 figures, 10 tables)

This paper contains 16 sections, 2 equations, 7 figures, 10 tables.

Introduction
Related Works
Unified Vision Models
Diffusion Models
In-Context Learning
In-Context Translation (ICT)
Context-based Data Construction
Training Framework
Experiments
Experimental Settings
Results and Analyses
Ablation Study
Generalization Capability
Few-shot In-context Inference
Efficiency and Scalability
...and 1 more sections

Figures (7)

Figure 1: ICT shows impressive results in unifying three distinct categories of vision tasks within a single framework. Each task is implicitly instructed by a random input-output pair (with an optional text prompt). Compared to its competitors such as Painter wang2023images and PromptDiffusion wang2023context, ICT unifies more categories of tasks. Please note that the inputs of competitors are different, and see more detailed comparisons in experiments.
Figure 2: An overview of the proposed framework. It utilizes a pre-trained SD to perform in-context image translation. The ground truth is established as a grid image where each row is an input-output pair from the same task. The first row composed by $E_{in}$ and $E_{out}$ serves as the image context, and the model is trained to predict $I_{gt}$ paired to the query image $I_{query}$. At inference time, we crop out the lower right region of the infilled output as the final result $I_{out}$.
Figure 3: Visual comparison results on visual recognition. "GT" represents ground truth. (a) PromptDiffusion wang2023context tends to mispredict pixels as "black". Red arrow indicates the mispredictions produced by Painter wang2023images. (b) The competing methods, especially PromptDiffusion wang2023context, fail to consistently produce accurate depth predictions (red arrow). (c) PromptDiffusion wang2023context tends to overpredict keypoints (red arrow). In contrast, the proposed ICT consistently produces accurate predictions.
Figure 4: Visual comparison results on low-level image processing. "GT" represents ground truth. PromptDiffusion wang2023context tends to generate distorted results or images with color shifts that do not adhere to the query image.
Figure 5: Visual comparison results on conditional image generation. "GT" represents ground truth. Painter wang2023images failed in generating realistic images from conditions with sparse semantics.
...and 2 more figures

In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation

TL;DR

Abstract

In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)