Table of Contents
Fetching ...

MIFO: Learning and Synthesizing Multi-Instance from One Image

Kailun Su, Ziqi He, Xi Wang, Yang Zhou

TL;DR

The paper tackles learning and reconstructing multiple semantically similar object instances from a single image. It introduces a two-stage framework that first disentangles instance semantics via reward- and penalty-based attention optimization, then achieves precise synthesis with in-box and out-of-box box control in SA/CA layers, guided by a dynamic weight schedule. The approach yields high instance-consistency and editability, demonstrated through extensive qualitative and quantitative evaluations, including human preferences, and outperforms strong baselines on multi-instance learning and synthesis tasks. This method offers robust, controllable personalized content creation and multi-object scene reconstruction with limited training data.

Abstract

This paper proposes a method for precise learning and synthesizing multi-instance semantics from a single image. The difficulty of this problem lies in the limited training data, and it becomes even more challenging when the instances to be learned have similar semantics or appearance. To address this, we propose a penalty-based attention optimization to disentangle similar semantics during the learning stage. Then, in the synthesis, we introduce and optimize box control in attention layers to further mitigate semantic leakage while precisely controlling the output layout. Experimental results demonstrate that our method achieves disentangled and high-quality semantic learning and synthesis, strikingly balancing editability and instance consistency. Our method remains robust when dealing with semantically or visually similar instances or rare-seen objects. The code is publicly available at https://github.com/Kareneveve/MIFO

MIFO: Learning and Synthesizing Multi-Instance from One Image

TL;DR

The paper tackles learning and reconstructing multiple semantically similar object instances from a single image. It introduces a two-stage framework that first disentangles instance semantics via reward- and penalty-based attention optimization, then achieves precise synthesis with in-box and out-of-box box control in SA/CA layers, guided by a dynamic weight schedule. The approach yields high instance-consistency and editability, demonstrated through extensive qualitative and quantitative evaluations, including human preferences, and outperforms strong baselines on multi-instance learning and synthesis tasks. This method offers robust, controllable personalized content creation and multi-object scene reconstruction with limited training data.

Abstract

This paper proposes a method for precise learning and synthesizing multi-instance semantics from a single image. The difficulty of this problem lies in the limited training data, and it becomes even more challenging when the instances to be learned have similar semantics or appearance. To address this, we propose a penalty-based attention optimization to disentangle similar semantics during the learning stage. Then, in the synthesis, we introduce and optimize box control in attention layers to further mitigate semantic leakage while precisely controlling the output layout. Experimental results demonstrate that our method achieves disentangled and high-quality semantic learning and synthesis, strikingly balancing editability and instance consistency. Our method remains robust when dealing with semantically or visually similar instances or rare-seen objects. The code is publicly available at https://github.com/Kareneveve/MIFO

Paper Structure

This paper contains 69 sections, 30 equations, 32 figures, 4 tables, 2 algorithms.

Figures (32)

  • Figure 1: Multi-object semantic learning for visually similar instances. Learning the semantics of multiple similar-looking instances from a single image is quite challenging, as their features are highly indistinguishable (left). Existing methods, such as Break-a-Scene (BaS) avrahami2023break, totally confuse the two objects ($\langle v_0 \rangle$ and $\langle v_1 \rangle$) in synthesis (Row 1, right). Our method successfully disentangles these subtle features and produces correct synthesis that adheres to the prompts and additional box control (Rows 2 & 3, right).
  • Figure 2: Illustration of reward-/penalty-based attention control. Red and blue circles represent the query vectors in CA, while the stars denote the two tokens to optimize in semantic learning. As the features of $\langle v_0 \rangle$ and $\langle v_1 \rangle$ are highly entangled, optimizing the tokens by considering only the positive samples (as reward-based approaches do) cannot distinguish the two objects. In contrast, our penalty-based solution aims to separate the tokens after semantic learning.
  • Figure 3: Framework of our method. We divide the multi-instance semantic learning problem into two stages: Disentangled Semantic Learning for acquiring semantic and visual representations, and Precise Synthesis with Box Control for controlled reconstruction and synthesis. Joint Sampling is employed in the semantic learning stage (see Appx. \ref{['appx_subsec-sampling_strategies']}).
  • Figure 4: Illustration of in-/out-of-box control in Self-Attention (SA)/Cross-Attention (CA) layers.
  • Figure 5: Results of semantic learning and precise synthesis with rare-seen objects.
  • ...and 27 more figures