Table of Contents
Fetching ...

NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing

Zhenyu Xu, Xiaoqi Shen, Haotian Nan, Xinyu Zhang

TL;DR

NumeriKontrol addresses the lack of precise numeric control in instruction-based image editing by introducing a plug-and-play Numeric Adapter for Diffusion Transformers. The approach encodes numeric strengths with unit-aware representations and injects them into MM-DiT through an in-context learning framework, enabling zero-shot multi-numeric editing. A dedicated CAT dataset provides ground-truth attribute scales from physically plausible sources, guiding accurate, continuous edits. Empirical results show NumeriKontrol achieves superior precision, consistency, and robustness across low- and high-level editing tasks, outperforming state-of-the-art baselines and validating the utility of unit-separated numeric conditioning for practical image editing workflows.

Abstract

Instruction-based image editing enables intuitive manipulation through natural language commands. However, text instructions alone often lack the precision required for fine-grained control over edit intensity. We introduce NumeriKontrol, a framework that allows users to precisely adjust image attributes using continuous scalar values with common units. NumeriKontrol encodes numeric editing scales via an effective Numeric Adapter and injects them into diffusion models in a plug-and-play manner. Thanks to a task-separated design, our approach supports zero-shot multi-condition editing, allowing users to specify multiple instructions in any order. To provide high-quality supervision, we synthesize precise training data from reliable sources, including high-fidelity rendering engines and DSLR cameras. Our Common Attribute Transform (CAT) dataset covers diverse attribute manipulations with accurate ground-truth scales, enabling NumeriKontrol to function as a simple yet powerful interactive editing studio. Extensive experiments show that NumeriKontrol delivers accurate, continuous, and stable scale control across a wide range of attribute editing scenarios. These contributions advance instruction-based image editing by enabling precise, scalable, and user-controllable image manipulation.

NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing

TL;DR

NumeriKontrol addresses the lack of precise numeric control in instruction-based image editing by introducing a plug-and-play Numeric Adapter for Diffusion Transformers. The approach encodes numeric strengths with unit-aware representations and injects them into MM-DiT through an in-context learning framework, enabling zero-shot multi-numeric editing. A dedicated CAT dataset provides ground-truth attribute scales from physically plausible sources, guiding accurate, continuous edits. Empirical results show NumeriKontrol achieves superior precision, consistency, and robustness across low- and high-level editing tasks, outperforming state-of-the-art baselines and validating the utility of unit-separated numeric conditioning for practical image editing workflows.

Abstract

Instruction-based image editing enables intuitive manipulation through natural language commands. However, text instructions alone often lack the precision required for fine-grained control over edit intensity. We introduce NumeriKontrol, a framework that allows users to precisely adjust image attributes using continuous scalar values with common units. NumeriKontrol encodes numeric editing scales via an effective Numeric Adapter and injects them into diffusion models in a plug-and-play manner. Thanks to a task-separated design, our approach supports zero-shot multi-condition editing, allowing users to specify multiple instructions in any order. To provide high-quality supervision, we synthesize precise training data from reliable sources, including high-fidelity rendering engines and DSLR cameras. Our Common Attribute Transform (CAT) dataset covers diverse attribute manipulations with accurate ground-truth scales, enabling NumeriKontrol to function as a simple yet powerful interactive editing studio. Extensive experiments show that NumeriKontrol delivers accurate, continuous, and stable scale control across a wide range of attribute editing scenarios. These contributions advance instruction-based image editing by enabling precise, scalable, and user-controllable image manipulation.

Paper Structure

This paper contains 24 sections, 6 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: NumeriKontrol generates precise editing trajectories across diverse attributes conditioned on a source image, editing instruction, and numeric strength. Trained on our synthesized dataset, the model enables precise control over low-level and high-level editing tasks.
  • Figure 2: a) Overall pipeline of NumeriKontrol. Numeric information is extracted from the editing instruction and encoded, subsequently being fused with the task ID embedding. Positive and negative numeric editing operations share an identical task ID. We construct a comprehensive operation dataset comprising both synthetically generated and real-captured images. b) Dataset generation. Dataset diversity is ensured through heterogeneous data sources. During training, origin and context images are randomly sampled from the attribute animation sequences.
  • Figure 3: a) Example of resolving coupled instruction. Training directly on such instructions fails, as the real start point of the edit is the image casted by a light but no the source image. b) Visualization of decoupled and order-independent editing through multiple numeric instructions.
  • Figure 4: Illustration of difference from physically synthesized data to morphing-generated intermediates. Interpolation-based morphing methods fail to preserve visual fidelity even in straightforward low-level editing scenarios such as ISO modification. FreeMorph, a representative diffusion-based morphing technique, samples trajectories with substantial deviation from ground truth sequences.
  • Figure 5: Visualization results on NumeriKontrol and baseline methods. Various scenarios, including outdoor scene, portrait and object editing, are selected from the results. The "delta" in each caption is replaced by the actual number above the images in NumeriKontrol. Captions of other methods are tuned respectively. Compared to others, the smile of the man generated by NumeriKontrol is less prominent because the model learned from a dataset containing only subtle smile examples.
  • ...and 5 more figures