Table of Contents
Fetching ...

Language-based Image Colorization: A Benchmark and Beyond

Yifan Li, Shuai Yang, Jiaying Liu

TL;DR

This work surveys and benchmarks language-based image colorization, emphasizing cross-modality alignment and the variety of conditioning strategies. It introduces Color-Turbo, a distilled one-step diffusion-based colorization method that ingests grayscale input and user prompts, with a tunable colorfulness parameter and a loss function combining $ \\mathcal{L}= \\lambda_{pixel} \\mathcal{L}_{pixel} + \\lambda_{lpips} \\mathcal{L}_{lpips} + \\lambda_{clip} \\mathcal{L}_{clip} + \\lambda_{adv} \\mathcal{L}_{adv} $, achieving substantial speedups and improved stability over prior diffusion-based approaches. The authors propose HI-FID to better align evaluation with human perception for colorization, and validate Color-Turbo across automatic and language-based tasks on multiple datasets, supplemented by a user study. Their analyses cover cross-modality design choices, conditioning paradigms, and stability considerations, providing a practical baseline and actionable insights for future research in controllable, efficient colorization. The work thus offers a comprehensive perspective for researchers and a ready-to-use framework for practitioners seeking fast, text-guided colorization with reliable color fidelity.

Abstract

Image colorization aims to bring colors back to grayscale images. Automatic image colorization methods, which requires no additional guidance, struggle to generate high-quality images due to color ambiguity, and provides limited user controllability. Thanks to the emergency of cross-modality datasets and models, language-based colorization methods are proposed to fully utilize the efficiency and flexibly of text descriptions to guide colorization. In view of the lack of a comprehensive review of language-based colorization literature, we conduct a thorough analysis and benchmarking. We first briefly summarize existing automatic colorization methods. Then, we focus on language-based methods and point out their core challenge on cross-modal alignment. We further divide these methods into two categories: one attempts to train a cross-modality network from scratch, while the other utilizes the pre-trained cross-modality model to establish the textual-visual correspondence. Based on the analyzed limitations of existing language-based methods, we propose a simple yet effective method based on distilled diffusion model. Extensive experiments demonstrate that our simple baseline can produces better results than previous complex methods with 14 times speed up. To the best of our knowledge, this is the first comprehensive review and benchmark on language-based image colorization field, providing meaningful insights for the community. The code is available at https://github.com/lyf1212/Color-Turbo.

Language-based Image Colorization: A Benchmark and Beyond

TL;DR

This work surveys and benchmarks language-based image colorization, emphasizing cross-modality alignment and the variety of conditioning strategies. It introduces Color-Turbo, a distilled one-step diffusion-based colorization method that ingests grayscale input and user prompts, with a tunable colorfulness parameter and a loss function combining , achieving substantial speedups and improved stability over prior diffusion-based approaches. The authors propose HI-FID to better align evaluation with human perception for colorization, and validate Color-Turbo across automatic and language-based tasks on multiple datasets, supplemented by a user study. Their analyses cover cross-modality design choices, conditioning paradigms, and stability considerations, providing a practical baseline and actionable insights for future research in controllable, efficient colorization. The work thus offers a comprehensive perspective for researchers and a ready-to-use framework for practitioners seeking fast, text-guided colorization with reliable color fidelity.

Abstract

Image colorization aims to bring colors back to grayscale images. Automatic image colorization methods, which requires no additional guidance, struggle to generate high-quality images due to color ambiguity, and provides limited user controllability. Thanks to the emergency of cross-modality datasets and models, language-based colorization methods are proposed to fully utilize the efficiency and flexibly of text descriptions to guide colorization. In view of the lack of a comprehensive review of language-based colorization literature, we conduct a thorough analysis and benchmarking. We first briefly summarize existing automatic colorization methods. Then, we focus on language-based methods and point out their core challenge on cross-modal alignment. We further divide these methods into two categories: one attempts to train a cross-modality network from scratch, while the other utilizes the pre-trained cross-modality model to establish the textual-visual correspondence. Based on the analyzed limitations of existing language-based methods, we propose a simple yet effective method based on distilled diffusion model. Extensive experiments demonstrate that our simple baseline can produces better results than previous complex methods with 14 times speed up. To the best of our knowledge, this is the first comprehensive review and benchmark on language-based image colorization field, providing meaningful insights for the community. The code is available at https://github.com/lyf1212/Color-Turbo.

Paper Structure

This paper contains 32 sections, 3 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Five typical color distortions of automatic colorization methods. From (a) to (e), the image is generated by HistoryNet HistoryNet, CT2 ct2, BigColor bigcolor, CIC cic and DDColor ddcolor.
  • Figure 2: Comparison of automatic and language-based methods. Automatic colorization methods bigcolorcolorformerddcolor tend to produce overflowed, under-saturated and incomplete results. Language-based method coco-lc can not only generate high quality results with deterministic color cues, but also produce diverse results according to different user commands.
  • Figure 3: Taxonomy of automatic and language-based methods.
  • Figure 4: Illustration of main features of existing language-based methods. Since L-CoDe lcode and L-CoDer lcoder decouple color-object words, color overflows will occur on some objects which have no color words accordingly. L-CoIns lcoins weakens the correlation between brightness and color with data augmentation, damaging the original semantics, as carrots are turned into yams. UniColor unicolor cannot achieve finer colorization, as broccoli turns to red. Diffusion-based L-CAD and COCO-LC produce more plausible results. COCO-LC generate more colorful image, but with some color overflows on the bowls compare with L-CAD.
  • Figure 5: We summarize four representative condition insertion paradigms, including (1) mid-layer grayscale features insertion, (2) ControlNet-based bypass encoder, (3) grayscale initial noise, (4) grayscale images as input. These four methods can constraint the generation process by grayscale images with different injection intensities and feature granularity.
  • ...and 9 more figures