Table of Contents
Fetching ...

Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild

Siyoon Jin, Jisu Nam, Jiyoung Kim, Dahyun Chung, Yeong-Seok Kim, Joonhyung Park, Heonjeong Chu, Seungryong Kim

TL;DR

This work tackles exemplar-based semantic image synthesis in complex, real-world scenes by introducing AM-Adapter, a learning-based module that augments self-attention with semantic-informed cross-image matching. It combines a dual-branch architecture (Appearance Net and Structure Net) with a categorical matching cost derived from segmentation maps and a 4D cost aggregation to robustly align exemplar appearance to target structure. A stage-wise training regime, plus an automated exemplar retrieval mechanism, keeps the diffusion backbone fixed while enriching it with precise local appearance transfer, yielding state-of-the-art results in semantic alignment and appearance fidelity across driving and indoor scenes. The approach enables controllable one-to-one transfer and segmentation-guided editing with strong generalization and practical applicability, as evidenced by extensive ablations and quantitative/user studies.

Abstract

Exemplar-based semantic image synthesis generates images aligned with semantic content while preserving the appearance of an exemplar. Conventional structure-guidance models like ControlNet, are limited as they rely solely on text prompts to control appearance and cannot utilize exemplar images as input. Recent tuning-free approaches address this by transferring local appearance via implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, prior works are often restricted to single-object cases or foreground object appearance transfer, struggling with complex scenes involving multiple objects. To overcome this, we propose AM-Adapter (Appearance Matching Adapter) to address exemplar-based semantic image synthesis in-the-wild, enabling multi-object appearance transfer from a single scene-level image. AM-Adapter automatically transfers local appearances from the scene-level input. AM-Adapter alternatively provides controllability to map user-defined object details to specific locations in the synthesized images. Our learnable framework enhances cross-image matching within augmented self-attention by integrating semantic information from segmentation maps. To disentangle generation and matching, we adopt stage-wise training. We first train the structure-guidance and generation networks, followed by training the matching adapter while keeping the others frozen. During inference, we introduce an automated exemplar retrieval method for selecting exemplar image-segmentation pairs efficiently. Despite utilizing minimal learnable parameters, AM-Adapter achieves state-of-the-art performance, excelling in both semantic alignment and local appearance fidelity. Extensive ablations validate our design choices. Code and weights will be released.: https://cvlab-kaist.github.io/AM-Adapter/

Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild

TL;DR

This work tackles exemplar-based semantic image synthesis in complex, real-world scenes by introducing AM-Adapter, a learning-based module that augments self-attention with semantic-informed cross-image matching. It combines a dual-branch architecture (Appearance Net and Structure Net) with a categorical matching cost derived from segmentation maps and a 4D cost aggregation to robustly align exemplar appearance to target structure. A stage-wise training regime, plus an automated exemplar retrieval mechanism, keeps the diffusion backbone fixed while enriching it with precise local appearance transfer, yielding state-of-the-art results in semantic alignment and appearance fidelity across driving and indoor scenes. The approach enables controllable one-to-one transfer and segmentation-guided editing with strong generalization and practical applicability, as evidenced by extensive ablations and quantitative/user studies.

Abstract

Exemplar-based semantic image synthesis generates images aligned with semantic content while preserving the appearance of an exemplar. Conventional structure-guidance models like ControlNet, are limited as they rely solely on text prompts to control appearance and cannot utilize exemplar images as input. Recent tuning-free approaches address this by transferring local appearance via implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, prior works are often restricted to single-object cases or foreground object appearance transfer, struggling with complex scenes involving multiple objects. To overcome this, we propose AM-Adapter (Appearance Matching Adapter) to address exemplar-based semantic image synthesis in-the-wild, enabling multi-object appearance transfer from a single scene-level image. AM-Adapter automatically transfers local appearances from the scene-level input. AM-Adapter alternatively provides controllability to map user-defined object details to specific locations in the synthesized images. Our learnable framework enhances cross-image matching within augmented self-attention by integrating semantic information from segmentation maps. To disentangle generation and matching, we adopt stage-wise training. We first train the structure-guidance and generation networks, followed by training the matching adapter while keeping the others frozen. During inference, we introduce an automated exemplar retrieval method for selecting exemplar image-segmentation pairs efficiently. Despite utilizing minimal learnable parameters, AM-Adapter achieves state-of-the-art performance, excelling in both semantic alignment and local appearance fidelity. Extensive ablations validate our design choices. Code and weights will be released.: https://cvlab-kaist.github.io/AM-Adapter/

Paper Structure

This paper contains 30 sections, 9 equations, 26 figures, 3 tables.

Figures (26)

  • Figure 1: AM-Adapter enables Exemplar-based Semantic Image Synthesis in-the-Wild. (a) Given an exemplar image and a target segmentation, AM-Adapter generates high-quality images that retain the local appearance of the exemplar and the accurate image structure defined by the segmentation map. We demonstrate the versatility of our method in various applications, including (b) controllable one-to-one appearance transfer with user-defined guidance, (c) image-to-image translation and (d) segmentation-based image editing.
  • Figure 2: Controllability of our AM-Adapter: The $M$-to-$N$ setting refers to a multiple-object many-to-many transfer, where $M$ and $N$ denote the number of instances of a specific category in the exemplar and target, respectively. By default, as shown in (a), AM-Adapter automatically matches appearance based on structural similarity. In addition, as shown in (b), it allows user-defined guidance to precisely transfer specific objects (e.g., red car) to target instances (White-masked objects indicate user-specified matches).
  • Figure 3: Attention Visualization: (a) Exemplar image, (b) target segmentation, and (g) generated image. Green and orange markers in (b) indicate query points. (c) and (d) show the augmented self-attention map $Q_t^Y (K_t^X)^T$ from the green marker, before and after applying AM-Adapter, respectively. (e) and (f) show the augmented self-attention map $Q_t^Y (K_t^X)^T$ from the orange marker, before and after applying AM-Adapter, respectively. The green marker is within 'car' (present in the reference), while the orange marker is within 'tree' (absent in the reference). AM-Adapter refines mismatches in (c) and (e), demonstrating its effectiveness in (d) and (f).
  • Figure 4: Overall Architecture.
  • Figure 5: Architecture of Appearance Matching Adapter.
  • ...and 21 more figures