Table of Contents
Fetching ...

Enabling Region-Specific Control via Lassos in Point-Based Colorization

Sanghyeon Lee, Jooyeol Yun, Jaegul Choo

TL;DR

This work tackles color collapse in point-based interactive colorization by introducing a lasso tool that bounds color propagation and a localization attention mask to gate cross-attention within user-defined regions. The method uses a Transformer-based pipeline where grayscale queries attend to color hints through a masked cross-attention map $M_l$, ensuring colors spread only within specified lassos and their vicinity. Key contributions include simulating user hints during training, a detailed hint-encoder and localized-attention architecture, and an objective based on the Huber loss in CIE $L\*a\*b\*$ space. Empirical results show the lasso-enabled approach reduces the number of interactions and time to reach target quality, mitigates color collapse on challenging datasets, and maintains competitive PSNR/LPIPS against point-only baselines, with practical benefits for interactive color editing.

Abstract

Point-based interactive colorization techniques allow users to effortlessly colorize grayscale images using user-provided color hints. However, point-based methods often face challenges when different colors are given to semantically similar areas, leading to color intermingling and unsatisfactory results-an issue we refer to as color collapse. The fundamental cause of color collapse is the inadequacy of points for defining the boundaries for each color. To mitigate color collapse, we introduce a lasso tool that can control the scope of each color hint. Additionally, we design a framework that leverages the user-provided lassos to localize the attention masks. The experimental results show that using a single lasso is as effective as applying 4.18 individual color hints and can achieve the desired outcomes in 30% less time than using points alone.

Enabling Region-Specific Control via Lassos in Point-Based Colorization

TL;DR

This work tackles color collapse in point-based interactive colorization by introducing a lasso tool that bounds color propagation and a localization attention mask to gate cross-attention within user-defined regions. The method uses a Transformer-based pipeline where grayscale queries attend to color hints through a masked cross-attention map , ensuring colors spread only within specified lassos and their vicinity. Key contributions include simulating user hints during training, a detailed hint-encoder and localized-attention architecture, and an objective based on the Huber loss in CIE space. Empirical results show the lasso-enabled approach reduces the number of interactions and time to reach target quality, mitigates color collapse on challenging datasets, and maintains competitive PSNR/LPIPS against point-only baselines, with practical benefits for interactive color editing.

Abstract

Point-based interactive colorization techniques allow users to effortlessly colorize grayscale images using user-provided color hints. However, point-based methods often face challenges when different colors are given to semantically similar areas, leading to color intermingling and unsatisfactory results-an issue we refer to as color collapse. The fundamental cause of color collapse is the inadequacy of points for defining the boundaries for each color. To mitigate color collapse, we introduce a lasso tool that can control the scope of each color hint. Additionally, we design a framework that leverages the user-provided lassos to localize the attention masks. The experimental results show that using a single lasso is as effective as applying 4.18 individual color hints and can achieve the desired outcomes in 30% less time than using points alone.

Paper Structure

This paper contains 20 sections, 2 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: The examples of the color collapse. The start mark and the corresponding lasso in the same color describe the region designated for each color hint by the user. By specifying regions, users can better control how colors spread, thereby mitigating color collapse and leading to a more intentional colorization process.
  • Figure 2: The overview of our framework. Our framework acquires color hints and corresponding lassos through a user interaction simulation process for training. The grayscale image is used as the query, and color hints as keys and values generate the cross-attention map $QK^T$. Subsequently, the attention map is modulated by an attention mask derived from the lassos to precisely control the influence of each color hint on the query image tokens.
  • Figure 3: Localization Attention Mask. For each color hint, we apply a mask with a value of 1 to the tokens corresponding to patches interior of the lasso areas. Simultaneously, we construct an unconditional mask, $M_u$, based on regions not overlapped by lassos. The final localization attention mask, $M_l$, is produced by concatenating $M_u$ and $M_C$.
  • Figure 4: Qualitative results compare with baselines. Each star and its matching-colored lasso highlight the user-selected region for that color. The presented results from our method reflect the colorization achieved through user-directed applications of both lassos and points with pre-defined lasso.
  • Figure 5: User study results on color collapse easy samples. We measure the average PSNR over the user interaction time, with the initial PSNR derived from each model’s unconditional inference.
  • ...and 15 more figures