MIFO: Learning and Synthesizing Multi-Instance from One Image
Kailun Su, Ziqi He, Xi Wang, Yang Zhou
TL;DR
The paper tackles learning and reconstructing multiple semantically similar object instances from a single image. It introduces a two-stage framework that first disentangles instance semantics via reward- and penalty-based attention optimization, then achieves precise synthesis with in-box and out-of-box box control in SA/CA layers, guided by a dynamic weight schedule. The approach yields high instance-consistency and editability, demonstrated through extensive qualitative and quantitative evaluations, including human preferences, and outperforms strong baselines on multi-instance learning and synthesis tasks. This method offers robust, controllable personalized content creation and multi-object scene reconstruction with limited training data.
Abstract
This paper proposes a method for precise learning and synthesizing multi-instance semantics from a single image. The difficulty of this problem lies in the limited training data, and it becomes even more challenging when the instances to be learned have similar semantics or appearance. To address this, we propose a penalty-based attention optimization to disentangle similar semantics during the learning stage. Then, in the synthesis, we introduce and optimize box control in attention layers to further mitigate semantic leakage while precisely controlling the output layout. Experimental results demonstrate that our method achieves disentangled and high-quality semantic learning and synthesis, strikingly balancing editability and instance consistency. Our method remains robust when dealing with semantically or visually similar instances or rare-seen objects. The code is publicly available at https://github.com/Kareneveve/MIFO
