Table of Contents
Fetching ...

AceTone: Bridging Words and Colors for Conditional Image Grading

Tianren Ma, Mingxiang Liao, Xijin Zhang, Qixiang Ye

Abstract

Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a $3\times32^3$ LUT vector to 64 discrete tokens with $ΔE<2$ fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone's results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.

AceTone: Bridging Words and Colors for Conditional Image Grading

Abstract

Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a LUT vector to 64 discrete tokens with fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone's results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.

Paper Structure

This paper contains 19 sections, 7 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: AceTone performs conditional color grading under two paradigms: reference-based (top) and instruction-based (bottom). It accurately captures subtle tonal characteristics, follows user intents, and produces visually coherent color adjustments.
  • Figure 2: Overview of the AceTone framework. AceTone integrates a vector-quantized tokenizer to represent grading operations. Its phased training includes: (1) vision–language pretraining to learn tone prediction from image pairs, and (2) a post-train stage with SFT and RL (simplified in this figure). During RL training, multiple tone hypotheses are generated and optimized via group relative policy optimization (GRPO), enabling faithful and controllable tone manipulation aligned with user intent.
  • Figure 3: Visualization of the tokenizer's compression quality. We apply the original and reconstructed LUT to the image, and report the corresponding $\Delta \text{E}$ metric.
  • Figure 4: A Qualitative Visualization of AceTone. Top: Style transfer comparison. AceTone effectively captures the target color style while avoiding color banding or unnatural hue shifts. Bottom: Instruction-guided grading comparison (from AceTone-Bench $[\text{Instruct}]$). AceTone faithfully follows user intent and produces visually appealing adjustments. Best viewed in color and zoomed in for full details.
  • Figure 5: Performance Curve during Training.