Table of Contents
Fetching ...

NumColor: Precise Numeric Color Control in Text-to-Image Generation

Muhammad Atif Butt, Diego Hernandez, Alexandra Gomez-Villa, Kai Wang, Javier Vazquez-Corral, Joost Van De Weijer

Abstract

Text-to-image diffusion models excel at generating images from natural language descriptions, yet fail to interpret numerical colors such as hex codes (#FF5733) and RGB values (rgb(255,87,51)). This limitation stems from subword tokenization, which fragments color codes into semantically meaningless tokens that text encoders cannot map to coherent color representations. We present NumColor, that enables precise numerical color control across multiple diffusion architectures. NumColor comprises two components: a Color Token Aggregator that detects color specifications regardless of tokenization, and a ColorBook containing 6,707 learnable embeddings that map colors to embedding space of text encoder in perceptually uniform CIE Lab space. We introduce two auxiliary losses, directional alignment and interpolation consistency, to enforce geometric correspondence between Lab and embedding spaces, enabling smooth color interpolation. To train the ColorBook, we construct NumColor-Data, a synthetic dataset of 500K rendered images with unambiguous color-to-pixel correspondence, eliminating the annotation ambiguity inherent in photographic datasets. Although trained solely on FLUX, NumColor transfers zero-shot to SD3, SD3.5, PixArt-α, and PixArt-Σ without model-specific adaptation. NumColor improves numerical color accuracy by 4-9x across five models, while simultaneously improving color harmony scores by 10-30x on GenColorBench benchmark.

NumColor: Precise Numeric Color Control in Text-to-Image Generation

Abstract

Text-to-image diffusion models excel at generating images from natural language descriptions, yet fail to interpret numerical colors such as hex codes (#FF5733) and RGB values (rgb(255,87,51)). This limitation stems from subword tokenization, which fragments color codes into semantically meaningless tokens that text encoders cannot map to coherent color representations. We present NumColor, that enables precise numerical color control across multiple diffusion architectures. NumColor comprises two components: a Color Token Aggregator that detects color specifications regardless of tokenization, and a ColorBook containing 6,707 learnable embeddings that map colors to embedding space of text encoder in perceptually uniform CIE Lab space. We introduce two auxiliary losses, directional alignment and interpolation consistency, to enforce geometric correspondence between Lab and embedding spaces, enabling smooth color interpolation. To train the ColorBook, we construct NumColor-Data, a synthetic dataset of 500K rendered images with unambiguous color-to-pixel correspondence, eliminating the annotation ambiguity inherent in photographic datasets. Although trained solely on FLUX, NumColor transfers zero-shot to SD3, SD3.5, PixArt-α, and PixArt-Σ without model-specific adaptation. NumColor improves numerical color accuracy by 4-9x across five models, while simultaneously improving color harmony scores by 10-30x on GenColorBench benchmark.
Paper Structure (39 sections, 8 equations, 15 figures, 4 tables)

This paper contains 39 sections, 8 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Lack of Numeric Color Understanding.Left: Text encoders fragment hex codes into arbitrary subwords that attend to unrelated image regions, leading to incorrect color generation. Right: CKA kornblith2019similarity and neighborhood consistency (k=8) between text embeddings and Lab space show that color names preserve perceptual structure, while hex codes and RGB values degrade by $\sim60\%$. NumColor recovers the structure comparable to color names.
  • Figure 2: Method overview.(a) Baseline: CLIP and T5 encode text; DiT generates images via iterative denoising. (b) Color Token Aggregator: A character-level sequence labeler unifies fragmented color tokens using linear classifier, trained with cross-entropy on numeric color text prompts. (c) ColorBook Training: Unified color tokens receive learned embeddings before T5 contextualization. The ColorBook (6,707 Lab anchors) is trained with flow matching loss on NumColor-Data, with directional and interpolation losses to preserve Lab geometry. Only ColorBook embeddings receive gradients; CLIP, T5, and DiT remain frozen.
  • Figure 3: NumColor-Data generation pipeline. We render 3D meshes from Objaverse-XL in diverse indoor and outdoor scenes. Each object is assigned a uniform Lambertian material with albedo set to the target Lab color. We use object-centered camera estimation and multi-orientation lighting to ensure diverse viewpoints. The dataset comprises 500K images spanning varied objects, scenes, and colors from ColorBook anchors. Caption generation---rendered images are paired with descriptive captions, validated by human experts, and converted to template prompts.
  • Figure 4: Qualitative results on FLUX with NumColor. We evaluate coarse colors spanning primary hues from deep navy to saturated green; fine-grained colors including perceptually adjacent grayscale, blue-to-cyan, and red-to-yellow colors; and interpolated colors synthesized via interpolating ColorBook anchors. NumColor maintains accurate color reproduction across objects.
  • Figure 5: Cross-model generalization. Baseline models versus NumColor integration across SD3, SD3.5, PixArt-$\alpha$, and PixArt-$\Sigma$. NumColor, trained on FLUX, transfers zero-shot to other diffusion models. Baselines exhibit incorrect hue mapping, ignored specifications, and rainbow artifacts; NumColor resolves these while preserving image quality.
  • ...and 10 more figures