Table of Contents
Fetching ...

Colorization Transformer

Manoj Kumar, Dirk Weissenborn, Nal Kalchbrenner

TL;DR

ColTran addresses the inherently stochastic problem of high-resolution image colorization by decomposing it into a coarse low-resolution autoregressive colorizer and two fast parallel upsampling networks. It introduces conditional transformer layers within an Axial Transformer framework to condition colorization on grayscale input, enabling global 2D context with $O(D\sqrt{D})$ complexity. The model achieves state-of-the-art FID on ImageNet (≈19.37) and strong human preferences, while maintaining fast sampling through semi-parallel generation and parallel upsampling. These results demonstrate the viability of fully attention-based colorization for high-resolution images and highlight the value of auxiliary predictions and conditioning components for improved fidelity and diversity.

Abstract

We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth. The code and pre-trained checkpoints for Colorization Transformer are publicly available at https://github.com/google-research/google-research/tree/master/coltran

Colorization Transformer

TL;DR

ColTran addresses the inherently stochastic problem of high-resolution image colorization by decomposing it into a coarse low-resolution autoregressive colorizer and two fast parallel upsampling networks. It introduces conditional transformer layers within an Axial Transformer framework to condition colorization on grayscale input, enabling global 2D context with complexity. The model achieves state-of-the-art FID on ImageNet (≈19.37) and strong human preferences, while maintaining fast sampling through semi-parallel generation and parallel upsampling. These results demonstrate the viability of fully attention-based colorization for high-resolution images and highlight the value of auxiliary predictions and conditioning components for improved fidelity and diversity.

Abstract

We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth. The code and pre-trained checkpoints for Colorization Transformer are publicly available at https://github.com/google-research/google-research/tree/master/coltran

Paper Structure

This paper contains 41 sections, 10 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Samples of our model showing diverse, high-fidelity colorizations.
  • Figure 2: Depiction of ColTran. It consists of 3 individual models: an autoregressive colorizer (left), a color upsampler (middle) and a spatial upsampler (right). Each model is optimized independently. The autoregressive colorizer (ColTran core) is an instantiation of Axial Transformer (Sec. \ref{['autorec_base']}, ho2019axial) with conditional transformer layers and an auxiliary parallel head proposed in this work (Sec. \ref{['cat']}). During training, the ground-truth coarse low resolution image is both the input to the decoder and the target. Masked layers ensure that the conditional distributions for each pixel depends solely on previous ground-truth pixels. (See Appendix \ref{['sec:autoregressive models']} for a recap on autoregressive models). ColTran upsamplers are stacked row/column attention layers that deterministically upsample color and space in parallel. Each attention block (in green) is residual and consists of the following operations: layer-norm $\rightarrow$ multihead self-attention $\rightarrow$ MLP.
  • Figure 3: Per pixel log-likelihood of coarse colored $64 \times 64$ images over the validation set as a function of training steps. We ablate the various components of the ColTran core in each plot. Left:ColTran with Conditional Transformer Layers vs a baseline Axial Transformer which conditions via addition (ColTran-B). ColTran-B 2x and ColTran-B 4x refer to wider baselines with increased model capacity. Center: Removing each conditional sub-component one at a time (no cLN, no cMLP and no cAtt). Right: Conditional shifts only (Shift), Conditional scales only (Scale), removal of kq conditioning in cAtt (cAtt, only v) and fixed mean pooling in cLN (cLN, mean pool). See Section \ref{['ablations']} for more details.
  • Figure 4: Left: FID of generated 64 $\times$ 64 coarse samples as a function of training steps for $\lambda=0.01$ and $\lambda=0.0$. Center: Final FID scores as a function of $\lambda$. Right: FID as a function of log-likelihood.
  • Figure 5: We display the per-pixel, maximum predicted probability over 512 colors as a proxy for uncertainty.
  • ...and 12 more figures