NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing
Zhenyu Xu, Xiaoqi Shen, Haotian Nan, Xinyu Zhang
TL;DR
NumeriKontrol addresses the lack of precise numeric control in instruction-based image editing by introducing a plug-and-play Numeric Adapter for Diffusion Transformers. The approach encodes numeric strengths with unit-aware representations and injects them into MM-DiT through an in-context learning framework, enabling zero-shot multi-numeric editing. A dedicated CAT dataset provides ground-truth attribute scales from physically plausible sources, guiding accurate, continuous edits. Empirical results show NumeriKontrol achieves superior precision, consistency, and robustness across low- and high-level editing tasks, outperforming state-of-the-art baselines and validating the utility of unit-separated numeric conditioning for practical image editing workflows.
Abstract
Instruction-based image editing enables intuitive manipulation through natural language commands. However, text instructions alone often lack the precision required for fine-grained control over edit intensity. We introduce NumeriKontrol, a framework that allows users to precisely adjust image attributes using continuous scalar values with common units. NumeriKontrol encodes numeric editing scales via an effective Numeric Adapter and injects them into diffusion models in a plug-and-play manner. Thanks to a task-separated design, our approach supports zero-shot multi-condition editing, allowing users to specify multiple instructions in any order. To provide high-quality supervision, we synthesize precise training data from reliable sources, including high-fidelity rendering engines and DSLR cameras. Our Common Attribute Transform (CAT) dataset covers diverse attribute manipulations with accurate ground-truth scales, enabling NumeriKontrol to function as a simple yet powerful interactive editing studio. Extensive experiments show that NumeriKontrol delivers accurate, continuous, and stable scale control across a wide range of attribute editing scenarios. These contributions advance instruction-based image editing by enabling precise, scalable, and user-controllable image manipulation.
